A random procedure generated the following sample (sequence of measurements):
You can download the data as a CSV file (for importing into a spreadsheet or R): list_counting_basic_data_92422.csv
First, it helps to sort the data.
You could use R to sort the data:
sort(c(50,57,58,51,51,59,58,59,51,56))
## [1] 50 51 51 51 56 57 58 58 59 59
You could also use a spreadsheet. Import the table into a spreadsheet. Then, highlight the column of measurements and use Sort function.
A random procedure generated the following sample (sequence of measurements):
You can download the data as a CSV file (for importing into a spreadsheet or R): list_counting_between_data_64523.csv
First, it helps to sort the data.
You could use R to sort the data:
sort(c(25,24,27,22,23,21,25,27,26,26))
## [1] 21 22 23 24 25 25 26 26 27 27
You could also use a spreadsheet. Import the table into a spreadsheet. Then, highlight the column of measurements and use Sort function.
A random procedure generated many measurements: list_counting_large_data_80983.csv
51.94, 51.97, 52.04, 50.64, 57.32, 58.67, 50.92, 58.9, 56.81, 56.44, 51.99, 52.43, 51.49, 53.19, 51.15, 59.5, 55.58, 51.9, 52.56, 54.75, 51.33, 56.55, 53.7, 53.22, 52.44, 52.52, 51.28, 56.02, 53.15, 54.86, 58.18, 50.58, 55.11, 53.01, 51.81, 56.05, 56.19, 52.59, 55.53, 55.89, 53.59, 50.79, 52.17, 52.52, 59.18, 54.09, 50.71, 58.73, 51.7, 50.63, 59.5, 55.11, 51.24, 55.48, 59.13, 53.15, 52.19, 55.39, 50.13, 57.45, 56.39, 57.42, 57.53, 59.61, 57.6, 57.89, 53.85, 55.16, 59.97, 54.53, 58.73, 58.25, 59.12, 59.36, 52.54, 52.11, 55.36, 54.33, 50.92, 52.52, 51.62, 51.52, 54.56, 50.76, 52.99, 53.52, 50.76, 59.99, 53.29, 59.25, 50.8, 53.69, 50.3, 55, 56.88, 51.45, 55.34, 57.45, 59.17, 58.17, 56.76, 55.17, 53.96, 52.15, 56.48, 53.95, 52.74, 50.83, 50.48, 57.19, 51.06, 55.58, 55.15, 50.17, 57.17, 56.01, 54.12, 57.79, 53.23, 52.47, 52.39, 57.12, 52.95, 55.31, 54.79, 57.55, 56.39, 51.55, 59.47, 50.82, 58.19, 59.4, 51.92, 59.18, 54.99, 59.94, 55.5, 58.72, 57.97, 52.26, 58.82, 56.7, 50.59, 50.78, 57.86, 55.18, 56.93, 56.84, 52.4, 56.51, 54.46, 50.79, 53.84, 54.69, 57.5, 50.33, 55.4, 54.9, 55.65, 53.88, 55.3, 54.11, 57.63, 54.19, 57.74, 50.04, 57.13, 50.27, 51.63, 54.32, 56.36, 57.35, 55.11, 58.8, 58.19
You will want to use a computer to answer these questions.
If you used a spreadsheet, you should end up with this solution csv.
To use R, the following commands would answer the questions.
x = c(51.94,51.97,52.04,50.64,57.32,58.67,50.92,58.9,56.81,56.44,51.99,52.43,51.49,53.19,51.15,59.5,55.58,51.9,52.56,54.75,51.33,56.55,53.7,53.22,52.44,52.52,51.28,56.02,53.15,54.86,58.18,50.58,55.11,53.01,51.81,56.05,56.19,52.59,55.53,55.89,53.59,50.79,52.17,52.52,59.18,54.09,50.71,58.73,51.7,50.63,59.5,55.11,51.24,55.48,59.13,53.15,52.19,55.39,50.13,57.45,56.39,57.42,57.53,59.61,57.6,57.89,53.85,55.16,59.97,54.53,58.73,58.25,59.12,59.36,52.54,52.11,55.36,54.33,50.92,52.52,51.62,51.52,54.56,50.76,52.99,53.52,50.76,59.99,53.29,59.25,50.8,53.69,50.3,55,56.88,51.45,55.34,57.45,59.17,58.17,56.76,55.17,53.96,52.15,56.48,53.95,52.74,50.83,50.48,57.19,51.06,55.58,55.15,50.17,57.17,56.01,54.12,57.79,53.23,52.47,52.39,57.12,52.95,55.31,54.79,57.55,56.39,51.55,59.47,50.82,58.19,59.4,51.92,59.18,54.99,59.94,55.5,58.72,57.97,52.26,58.82,56.7,50.59,50.78,57.86,55.18,56.93,56.84,52.4,56.51,54.46,50.79,53.84,54.69,57.5,50.33,55.4,54.9,55.65,53.88,55.3,54.11,57.63,54.19,57.74,50.04,57.13,50.27,51.63,54.32,56.36,57.35,55.11,58.8,58.19)
length(x)
## [1] 175
sum(x<57)
## [1] 129
sum(x>55.5)
## [1] 68
sum(abs(x-57)<3)
## [1] 99
sum(abs(x-56.2)>1.5)
## [1] 118
A random procedure generated many measurements: download data
Please complete the frequency distribution using breaks 70, 75, 80, 85, 90:
| Interval | Frequency |
|---|---|
| 70 to 75 | |
| 75 to 80 | |
| 80 to 85 | |
| 85 to 90 |
You will want to use a computer to answer these questions.
In a spreadsheet, open the data, add the breaks as a column; then, use the FREQUENCY function.
In R, open the data and use the hist function. You supply the breaks and read the counts:
mydata = read.csv("make_freq_dist.csv")
x = mydata$x
myhist = hist(x,breaks=c(70,75,80,85,90))
myhist$counts
## [1] 48 12 5 5
| interval | frequency |
|---|---|
| 70 to 75 | 48 |
| 75 to 80 | 12 |
| 80 to 85 | 5 |
| 85 to 90 | 5 |
A random procedure generated 75 measurements, which were organized into the frequency distribution shown below. You can assume the measurements are of a continuous random variable, such that every measurement is in one of the intervals (and not on a break).
| interval | frequency |
|---|---|
| 55 to 60 | 7 |
| 60 to 65 | 13 |
| 65 to 70 | 21 |
| 70 to 75 | 12 |
| 75 to 80 | 12 |
| 80 to 85 | 10 |
The first 4 questions involve adding up frequencies of the indicated intervals. The last 4 questions can be done by guessing and checking until something works.
A random procedure generated 70 measurements, which were organized into the histogram shown below. You can assume the measurements are of a continuous variable, such that every measurement is in one of the intervals (and not on a break).
The first 4 questions involve adding up frequencies of the indicated intervals. The last 4 questions can be done by guessing and checking until something works.
A standard twelve-sided die was rolled 80 times, and the results were organized into the histogram shown below. In dice notation, we could say the results of 80d12 were plotted as a histogram. (Pedantic sidenote: it is common to interpret 80d12 as the SUM of 80 rolls, but we will interpret 80d12 as the LIST of 80 rolls and use as the sum of 80 rolls.)
The first 4 questions involve adding up frequencies of the indicated intervals. The last 4 questions can be done by guessing and checking until something works.
A standard eight-sided die was rolled 100 times, and the results were organized into the pie chart shown below.
The first 4 questions involve adding up frequencies of the indicated intervals. The last 4 questions can be done by guessing and checking until something works.
A random procedure generated the following sample (sequence of measurements):
You can download the data as a CSV file (for importing into a spreadsheet or R): list_counting_basic_data_41468.csv
First, it helps to sort the data.
You could use R to sort the data:
sort(c(14,19,10,16,17,10,12,10,19,17))
## [1] 10 10 10 12 14 16 17 17 19 19
You could also use a spreadsheet. Import the table into a spreadsheet. Then, highlight the column of measurements and use Sort function.
The lengths (in centimeters) of 10 lizards were recorded.
You can download the data as a CSV file (for importing into a spreadsheet or R): lizard_data.csv
First, it helps to sort the data.
You could use R to sort the data:
sort(c(97,98,94,93,98,96,99,92,95,99))
## [1] 92 93 94 95 96 97 98 98 99 99
You could also use a spreadsheet. Import the table into a spreadsheet. Then, highlight the column of measurements and use Sort function.
Jordan is practicing free throws. They has recorded the results of many free throws.
## Hit Hit Hit Miss Hit Hit Hit Hit Hit Hit Hit Miss Miss Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Miss Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Miss Hit Miss Hit Hit Hit Hit Miss Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Miss Hit Hit Hit Hit Hit Hit Hit Miss Hit Hit Hit Hit Hit Hit Hit Miss Hit Hit Hit Miss Hit Hit Hit Hit Hit Miss Hit Miss Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit Miss Hit Hit Hit
You can download the data as a csv: basketball_proportion.csv. The header and first four rows are shown below.
| i | hit_or_miss |
|---|---|
| 1 | Hit |
| 2 | Hit |
| 3 | Hit |
| 4 | Miss |
I recommend using either R or a spreadsheet.
First, download the csv. Also, write the following script as save it as basketball_proportion.r. Put both files in the same directory (folder). Run the script.
mydata = read.csv("basketball_proportion.csv")
x = mydata$hit_or_miss
n = length(x)
ns = sum(x=="Hit")
nf = sum(x=="Miss")
phat = ns/n
qhat = nf/n
cat(sprintf("n=%d, ns=%d, nf=%d, phat=%.4f, qhat=%.4f",n,ns,nf,phat,qhat))
## n=106, ns=92, nf=14, phat=0.8679, qhat=0.1321
If you are using Rstudio, you may need to click Session, Set Working Directory, Source File Location while the script (basketball_proportion.r) is the open tab.
First, if you scroll down, it should be clear there are 106 rows of data, because the last row has . In column C use IF(B2="Hit",1,0) and in column D use IF(B2="Miss",1,0), and extend the formulas down, to get columns of 0s and 1s, then use SUM(C2:C106) and SUM(D2:D106) to get and . You can divide these by to determine and .
You can see a solution CSV: proportion_solution.csv. Remember, you can hit ctrl+~ to see the formulas. You may need to enlarge a cell if it shows ###.
A random procedure generated measurements: download data
I’ve already determined the frequencies. Please determine the relative frequencies and the densities. A brief description of relative frequency and density can be found here.
| Interval | Frequency | Relative Frequency | Density |
|---|---|---|---|
| 30 to 32 | 4 | ||
| 32 to 34 | 5 | ||
| 34 to 36 | 9 | ||
| 36 to 38 | 43 | ||
| 38 to 40 | 69 |
To determine the relative frequencies, just divide each frequency by 130 (because ). To determine the densities, divide the relative frequencies by the width of the interval, which in this case is the same for each interval ().
| Interval | Frequency | Relative Frequency | Density |
|---|---|---|---|
| 30 to 32 | 4 | 0.03077 | 0.01538 |
| 32 to 34 | 5 | 0.03846 | 0.01923 |
| 34 to 36 | 9 | 0.06923 | 0.03462 |
| 36 to 38 | 43 | 0.3308 | 0.1654 |
| 38 to 40 | 69 | 0.5308 | 0.2654 |
A random procedure generated 120 measurements, which were organized into the frequency distribution shown below. You can assume the measurements are of a continuous random variable, such that every measurement is in one of the intervals (and not on a break).
| interval | frequency | relative frequency | density |
|---|---|---|---|
| 60 to 65 | 51 | 0.425 | 0.085 |
| 65 to 70 | 10 | 0.08333 | 0.01667 |
| 70 to 75 | 5 | 0.04167 | 0.008333 |
| 75 to 80 | 3 | 0.025 | 0.005 |
| 80 to 85 | 13 | 0.1083 | 0.02167 |
| 85 to 90 | 38 | 0.3167 | 0.06333 |
The first 4 questions involve adding up relative frequencies of the indicated intervals. The last 4 questions can be done by guessing and checking until something works.
A random procedure generated 50 measurements, which were organized into the histogram shown below. You can assume the measurements are of a continuous variable, such that every measurement is in one of the intervals (and not on a break).
You may find it helpful to convert the densities to frequencies by multiplying each density by both the total sample size () and the width of the bar (0.5).
The first 4 questions involve adding up frequencies of the indicated intervals. The last 4 questions can be done by guessing and checking until something works.
A standard eight-sided die was rolled 200 times, and the results were organized into the histogram shown below. In dice notation, we could say the results of 200d8 were plotted as a histogram. (Pedantic sidenote: it is common to interpret 200d8 as the SUM of 200 rolls, but we will interpret 200d8 as the LIST of 200 rolls and use as the sum of 200 rolls.)
The first 4 questions involve adding up relative frequencies of the indicated intervals. The last 4 questions can be done by guessing and checking until something works.
You may also find it is helpful to determine frequencies (counts). You do this by multiplying each relative frequency by the total number of measurements.
A standard six-sided die was rolled 100 times, and the results were organized into the pie chart shown below. The outside of the circle marks the cumulative proportion.
The first 4 questions involve adding up relative frequencies of the indicated intervals. The last 4 questions can be done by guessing and checking until something works.
Match the five histograms with their appropriate description.
This is definitional.
A sample of size was taken from an unknown population.
4.08, 1.04, 1.72, 2.40, 9.14, 9.41, 1.14, 0.01, 9.07, 0.45,
7.26, 0.30, 9.74, 0.37, 4.29, 1.08, 1.57, 6.62, 7.01, 7.00,
1.22, 3.20, 0.27, 8.65, 6.00, 3.37, 9.94, 9.16, 9.98, 9.45,
7.44, 1.42, 7.74, 1.65, 5.59, 0.80, 2.92, 7.38, 1.03, 9.85,
5.09, 0.00, 3.83, 7.92, 0.80, 9.90, 9.38, 0.91, 0.26, 9.19,
5.98, 7.95, 7.94, 9.69, 0.19, 4.04, 2.78, 9.51, 8.92, 1.74,
9.50, 4.46, 0.54, 4.83, 0.71, 9.91, 2.69, 2.73, 5.21, 0.19,
9.80, 3.59, 0.38, 7.36, 3.52, 5.03, 3.58, 2.74, 9.99, 4.77,
3.30, 0.31, 2.16, 9.32, 3.06, 7.30, 1.58, 6.27, 3.40, 9.20,
8.85, 5.92, 0.07, 2.94, 1.15, 8.53, 0.86, 2.40, 0.30, 7.94,
0.02, 6.92, 0.39, 0.00, 8.74, 9.99, 0.54, 2.76, 0.75, 5.10,
2.66, 0.00, 8.68, 9.95, 0.00, 2.08, 6.74,10.00, 3.42, 5.92,
4.67, 0.65, 9.24, 9.18, 1.95, 5.18, 9.23, 8.16, 9.97, 4.84,
4.24, 0.00, 9.98, 6.62, 5.97, 4.89, 8.76, 1.10, 9.25, 1.34,
0.97, 0.92, 0.33, 5.43, 7.03, 3.93, 4.58, 7.51, 9.81, 9.28,
9.52, 7.48, 2.42, 9.85, 3.40, 2.44, 3.18, 0.59, 6.57, 0.80,
3.54, 3.07, 9.78, 4.58, 4.21, 9.65, 1.72, 6.36,10.00, 9.41,
7.72, 8.80, 1.44, 7.51, 0.96, 5.66, 9.92, 7.80, 4.88, 1.48,
7.26, 6.45, 7.27, 7.63, 0.30, 0.38, 7.11, 7.60, 6.08, 4.99,
1.22, 9.18, 0.37, 7.43, 9.92, 8.78, 9.92, 1.08, 9.97,10.00
You can download the data as a CSV. Determine which histogram visualizes the data, and describe the shape of the data.
You should make a histogram. This is easy in R.
x = read.csv("make_hist.csv")$x
hist(x)
Using a spreadsheet is way more work.
A sample was gathered.
You can download the data as a CSV file.
Determine , the sample mean. Your answer can be rounded to the nearest tenth.
You need to sum the values (, see summation) and divide by the sample size ().
You can round to the nearest tenth: 34.1.
In a spreadsheet, you can use the AVERAGE function. You can see a solution spreadsheet here.
In R, you can use the MEAN function.
# First, get the csv into the working directory... then...
data = read.csv("get_mean.csv")
x = data$x
xbar = mean(x)
round(xbar,1)
## [1] 34.1
A sample was gathered.
You can download the data as a CSV file.
Determine the sample median. Please enter an exact answer.
To determine the median by hand, you first sort the sample. If the sample size is odd, just find the number in the middle number (the th value). If the sample size is even, determine the mean of the middle two numbers (the th and th values).
In a spreadsheet, you can use the MEDIAN function. You can see a solution spreadsheet here.
In R, you can use the MEDIAN function.
# First, get the csv into the working directory... then...
data = read.csv("get_median.csv")
x = data$x
median(x)
## [1] 33.51
Depending on the type of distribution, we can make a strong claim regarding the mean and median.
A sample was gathered and visualized with a histogram.
What claim can you make regarding the mean and median?
The distribution is skew-left, so .
A sample was gathered (from a Bernoulli random variable).
You can download the data as a CSV file.
Determine , the sample mean. Actually, in this special case of 0s and 1s, the sample mean is called the sample proportion (=“p hat”). So, determine the sample proportion. Your answer can be rounded to the nearest hundredth.
You need to sum the values (, see summation) and divide by the sample size ().
In the context of 0s and 1s, it is more appropriate to use .
Notice: the mean of 0s and 1s is the proportion of 1s. This is a good reason to use 0 for FALSE/“fail” and 1 for TRUE/“success”.
In a spreadsheet, you can use the AVERAGE function. You can see a solution spreadsheet here.
In R, you can use the MEAN function.
# First, get the csv into the working directory... then...
data = read.csv("get_mean_0s_1s.csv")
## Warning in file(file, "rt"): cannot open file 'get_mean_0s_1s.csv': No
## such file or directory
## Error in file(file, "rt"): cannot open the connection
x = data$x
## Error in data$x: object of type 'closure' is not subsettable
xbar = mean(x)
phat = xbar #because data is 0s and 1s
round(phat,2)
## [1] 0.33
A sample was gathered.
You can download the data as a CSV file.
Determine the sample range. Please enter an exact answer.
To determine the range by hand, you subtract the minimum value from the maximum value. You could first sort all the data. Determine the minimum. Determine the maximum. Take the difference.
In a spreadsheet, you can use the MIN function and MAX function.
You can see a solution spreadsheet here.
In R, you can use the MIN and MAX functions.
# First, get the csv into the working directory... then...
data = read.csv("get_range.csv")
x = data$x
range = max(x)-min(x)
print(range)
## [1] 5.68
A sample was gathered.
You can download the data as a CSV file.
Determine the sample’s mean absolute deviation (MAD) around the mean. You can round your answer to the hundredths place.
First, determine the mean of the sample.
Determine the absolute deviations (distances from ).
| deviations = | AbsDev = | ||
|---|---|---|---|
| 1 | 72.65 | 5.12 | 5.12 |
| 2 | 67.56 | 0.03 | 0.03 |
| 3 | 71.47 | 3.94 | 3.94 |
| 4 | 68.35 | 0.82 | 0.82 |
| 5 | 64.66 | -2.87 | 2.87 |
| 6 | 63.62 | -3.91 | 3.91 |
| 7 | 61.59 | -5.94 | 5.94 |
| 8 | 74.02 | 6.49 | 6.49 |
| 9 | 66.97 | -0.56 | 0.56 |
| 10 | 65.00 | -2.53 | 2.53 |
| 11 | 65.44 | -2.09 | 2.09 |
| 12 | 69.15 | 1.62 | 1.62 |
| 13 | 60.46 | -7.07 | 7.07 |
| 14 | 74.48 | 6.95 | 6.95 |
Now, take the mean of the absolute deviations.
You can do this with a spreadsheet.
You can do this with R
x = read.csv("get_MAD.csv")$x
xbar = mean(x)
deviations = x-xbar
AbsDev = abs(deviations)
MAD = mean(AbsDev)
x
## [1] 72.65 67.56 71.47 68.35 64.66 63.62 61.59 74.02 66.97 65.00 65.44
## [12] 69.15 60.46 74.48
xbar
## [1] 67.53
deviations
## [1] 5.12 0.03 3.94 0.82 -2.87 -3.91 -5.94 6.49 -0.56 -2.53 -2.09
## [12] 1.62 -7.07 6.95
AbsDev
## [1] 5.12 0.03 3.94 0.82 2.87 3.91 5.94 6.49 0.56 2.53 2.09 1.62 7.07
## [14] 6.95
MAD
## [1] 3.567143
A sample was gathered.
You can download the data as a CSV file.
Determine the biased sample variance (without Bessel correction). You can round your answer to the hundredths place.
First, determine the mean of the sample.
Determine the squared deviations (squared distances from ).
| deviations = | SqrDev = | ||
|---|---|---|---|
| 1 | 65.96 | 0.07 | 0.0049 |
| 2 | 68.45 | 2.56 | 6.5536 |
| 3 | 65.81 | -0.08 | 0.0064 |
| 4 | 69.73 | 3.84 | 14.7456 |
| 5 | 65.14 | -0.75 | 0.5625 |
| 6 | 67.83 | 1.94 | 3.7636 |
| 7 | 61.32 | -4.57 | 20.8849 |
| 8 | 66.27 | 0.38 | 0.1444 |
| 9 | 62.50 | -3.39 | 11.4921 |
Now, take the mean of the squared deviations.
You can do this with a spreadsheet.
And, actually, you can skip a lot of work by using the VAR.P function. Using the population variance formula is equivalent to using the biased sample variance formula.
You can do this with R
x = read.csv("get_VAR.csv")$x
xbar = mean(x)
deviations = x-xbar
sqrdev = deviations^2
VAR = mean(sqrdev)
x
## [1] 65.96 68.45 65.81 69.73 65.14 67.83 61.32 66.27 62.50
xbar
## [1] 65.89
deviations
## [1] 0.07 2.56 -0.08 3.84 -0.75 1.94 -4.57 0.38 -3.39
sqrdev
## [1] 0.0049 6.5536 0.0064 14.7456 0.5625 3.7636 20.8849 0.1444
## [9] 11.4921
VAR
## [1] 6.462
# The built-in var() function almost works, but it is too fancy, and makes a Bessel correction. To use it, we need to undo the Bessel correction.
n = length(x)
var(x)*(n-1)/(n)
## [1] 6.462
A sample was gathered.
You can download the data as a CSV file.
Determine the unbiased sample variance (with Bessel correction). You can round your answer to the hundredths place.
You probably wonder why you would make the Bessel correction. The reason is important. We will see that the main goal of statistics is to infer the underlying probability distribution of a collection of empirical observations. In other words, we have a lottery machine filled with many balls (population), but we only see a small sample of those balls, and our goal is to guess what the population looks like based on a small sample (see statistical inference).
It turns out that when guessing the population’s variance from a small sample, your guess is better after making the Bessel correction.
First, determine the mean of the sample.
Determine the squared deviations (squared distances from ).
| deviations = | SqrDev = | ||
|---|---|---|---|
| 1 | 54.13 | -0.25 | 0.0625 |
| 2 | 54.72 | 0.34 | 0.1156 |
| 3 | 53.93 | -0.45 | 0.2025 |
| 4 | 54.16 | -0.22 | 0.0484 |
| 5 | 55.00 | 0.62 | 0.3844 |
| 6 | 53.81 | -0.57 | 0.3249 |
| 7 | 54.50 | 0.12 | 0.0144 |
| 8 | 54.79 | 0.41 | 0.1681 |
Determine the unbiased sample variance by summing the squared deviations and dividing the sum by .
You can do this with a spreadsheet.
And, actually, you can skip a lot of work by using the VAR function.
You can do this with R
x = read.csv("get_VAR.csv")$x
var(x)
## [1] 0.1886857
You could also do it the long way.
x = read.csv("get_VAR.csv")$x
n = length(x)
xbar = mean(x)
deviations = x-xbar
sqrdev = deviations^2
VAR = sum(sqrdev)/(n-1)
VAR
## [1] 0.1886857
A sample was gathered.
You can download the data as a CSV file.
Determine the uncorrected sample standard deviation (without Bessel correction, biased).
You can round your answer to the hundredths place.
Determine the mean of the sample.
Determine the squared deviations (squared distances from ).
| deviations = | SqrDev = | ||
|---|---|---|---|
| 1 | 50.07 | -2.14 | 4.5796 |
| 2 | 52.02 | -0.19 | 0.0361 |
| 3 | 56.29 | 4.08 | 16.6464 |
| 4 | 53.73 | 1.52 | 2.3104 |
| 5 | 50.38 | -1.83 | 3.3489 |
| 6 | 53.85 | 1.64 | 2.6896 |
| 7 | 50.51 | -1.70 | 2.8900 |
| 8 | 53.98 | 1.77 | 3.1329 |
| 9 | 51.71 | -0.50 | 0.2500 |
| 10 | 50.64 | -1.57 | 2.4649 |
| 11 | 50.05 | -2.16 | 4.6656 |
| 12 | 53.29 | 1.08 | 1.1664 |
Find the mean of the squared deviations.
Take the square root of the variance.
You can do this with a spreadsheet.
And, actually, you can skip a lot of work by using the STDEV.P function. Using the population standard deviation formula is equivalent to using the biased sample standard deviation formula.
You can do this with R
x = read.csv("get_SD.csv")$x
xbar = mean(x)
deviations = x-xbar
sqrdev = deviations^2
VAR_biased = mean(sqrdev)
SD_biased = sqrt(VAR_biased)
SD_biased
## [1] 1.918784
# This could also be done with a one-liner
x = read.csv("get_SD.csv")$x
SD_biased2 = sqrt(mean((x-mean(x))^2))
SD_biased2
## [1] 1.918784
# The built-in sd() function almost works, but it is too fancy, and makes a Bessel correction. To use it, we need to undo the Bessel correction.
x = read.csv("get_SD.csv")$x
n = length(x)
SD_biased3 = sd(x)*sqrt((n-1)/(n))
SD_biased3
## [1] 1.918784
A sample was gathered.
You can download the data as a CSV file.
Determine the corrected sample standard deviation (with Bessel correction). You can round your answer to the hundredths place.
You probably wonder why you would make the Bessel correction. The reason is important. We will see that the main goal of statistics is to infer the underlying probability distribution of a collection of empirical observations. In other words, we have a lottery machine filled with many balls (population), but we only see a small sample of those balls, and our goal is to guess what the population looks like based on a small sample (see statistical inference).
It turns out that when guessing the population’s standard deviation from a small sample, your guess is better after making the Bessel correction.
First, determine the mean of the sample.
Determine the squared deviations (squared distances from ).
| deviations = | SqrDev = | ||
|---|---|---|---|
| 1 | 20 | -11 | 121 |
| 2 | 31 | 0 | 0 |
| 3 | 43 | 12 | 144 |
| 4 | 39 | 8 | 64 |
| 5 | 45 | 14 | 196 |
| 6 | 21 | -10 | 100 |
| 7 | 37 | 6 | 36 |
| 8 | 20 | -11 | 121 |
| 9 | 20 | -11 | 121 |
| 10 | 45 | 14 | 196 |
| 11 | 20 | -11 | 121 |
Determine the unbiased sample variance by summing the squared deviations and dividing the sum by .
Determine the corrected sample standard deviation by taking the square root of the unbiased sample variance.
You can do this with a spreadsheet.
And, actually, you can skip a lot of work by using the STDEV function.
You can do this with R
x = read.csv("get_SD.csv")$x
sd(x)
## [1] 11.04536
You could also do it the long way.
x = read.csv("get_SD.csv")$x
n = length(x)
xbar = mean(x)
deviations = x-xbar
sqrdev = deviations^2
VAR = sum(sqrdev)/(n-1)
SD = sqrt(VAR)
SD
## [1] 11.04536
A large sample () was gathered and visualized with a histogram.
When a symmetric distribution is sampled thoroughly (), you can use the following rules of thumb to estimate the mean and standard deviation.
| Shape | Estimated mean | Estimated standard deviation |
|---|---|---|
| Bell | ||
| Uniform | ||
| Bimodal |
Notice the bell has the smallest standard deviation for a given range. This is because many measurements are near the mean, and just a few are near the edges.
The bimodal has the largest standard deviation for a given range. This is because many measurements are near the edges, and just a few are near the mean.
The uniform distribution has about equal numbers of measurements near the mean and near the edges, so it has a standard deviation between the two other shapes (for a given range).
You’ll notice the bimodal estimate is probably the worst estimate. The estimate would be more accurate if none of the values were near the mean, and all the values were near the edge.
Three large samples were taken from three different populations. Their distributions are shown as histograms.
All three distributions look like they have similar uniform shape, but their centers and spreads are all different.
Three large samples were taken from three different populations. Their distributions are shown as histograms.
All three distributions look like they have similar ranges (widths), but different shapes. So, we will use the fact that for a given range, bell shape has the smallest standard deviation and bimodal has the largest standard deviation. This is because a bell shape has many measurements near the middle, whereas the bimodal shape has many measurements near the edges of its interval.
Sometimes a population is well characterized. In this case we know its mean and standard deviation. We use greek letters (“mu”) and (“sigma”) when describing the population (instead of the (“xbar”) and used for sample mean and sample standard deviation).
When measuring individuals from a population, we expect most measurements to be within the interval of typical measurements. We will define the interval of typical measurements (using interval notation): In other words, we expect we expect most measurements to be between two bounds.
A population of lizards has a mean length of cm and a standard deviation of cm. Determine the interval of typical measurements.
You need to use the formulas. Remember your order of operations!
The following spinner has a population mean and a population standard deviation . We can think of a spinner as an infinite population from which we can take as many independent measurements from as we want.
A sample (200 measurements) was taken.
65.5914, 69.6591, 62.3152, 62.0706, 64.2983, 66.9836, 67.3504, 64.5599, 64.4641, 65.8275,
62.4167, 66.1170, 63.8299, 61.2904, 67.3588, 63.3710, 65.2219, 66.6059, 62.7939, 61.4353,
68.0023, 65.2899, 67.3512, 65.9362, 65.2307, 68.4660, 65.6860, 66.3906, 65.8748, 62.7641,
62.5626, 67.4819, 67.3861, 65.0441, 63.3612, 67.0488, 66.5450, 65.0290, 64.0181, 66.5035,
66.9666, 64.7889, 66.4093, 66.8305, 61.2497, 63.8005, 62.5556, 61.5403, 65.5833, 64.6528,
66.3852, 63.9360, 67.2303, 62.4962, 64.8687, 67.8652, 62.9088, 68.9563, 62.7787, 64.8576,
63.8498, 63.5096, 65.7014, 66.0777, 63.1418, 64.9178, 65.6313, 66.4512, 68.1289, 67.5895,
65.9020, 66.0343, 64.1611, 68.5267, 65.0194, 63.8384, 63.3480, 65.3929, 64.9688, 63.9628,
66.1697, 66.4425, 66.1182, 63.8994, 64.8656, 59.4777, 64.8277, 64.5477, 62.8686, 64.6629,
64.1792, 62.8685, 65.2674, 65.4781, 62.3302, 63.9020, 65.5931, 63.7468, 70.1651, 67.6230,
62.7863, 59.7476, 63.8277, 62.8221, 64.5745, 62.7795, 65.4342, 65.2129, 65.8897, 66.0036,
62.6035, 65.4064, 65.1463, 66.0459, 61.7113, 63.5175, 65.6731, 68.7174, 66.9093, 67.2812,
70.8647, 67.4793, 68.6700, 68.3615, 63.8713, 65.2969, 66.6215, 61.9097, 64.6685, 63.6505,
66.3863, 63.7009, 64.4872, 67.5388, 68.6062, 65.5804, 64.6326, 68.7246, 71.2573, 67.8468,
66.1906, 66.2312, 65.7185, 64.4274, 67.3468, 62.1675, 64.7659, 61.8738, 67.8193, 65.1262,
68.0571, 63.0963, 62.6560, 61.7648, 66.6956, 69.2242, 67.5693, 65.5108, 61.6663, 65.0250,
63.8676, 67.8875, 65.0600, 65.7887, 64.6236, 64.1440, 65.4003, 65.6019, 65.6642, 64.6577,
67.6506, 64.4485, 65.4193, 68.8963, 66.0985, 64.4691, 62.0262, 66.9004, 63.8215, 64.4759,
65.3226, 61.4894, 68.3864, 68.1937, 63.7190, 69.1519, 64.8021, 69.4722, 64.5362, 64.1693,
61.4202, 64.7983, 63.6826, 62.4361, 63.5248, 61.8594, 65.0435, 65.7796, 61.2055, 62.0183
You can download the data as a CSV file.
What proportion of the 200 measurements are outside the interval of typical measurements?
First, determine the interval of typical measurements.
Now, determine how many measurements (and divide by 200 for what proportion) are either less than 61 or more than 69. You’ll want to use a computer.
x = read.csv("check_interval_typical_measurements.csv")$x
n = length(x)
count_outside = sum(x<61 | x>69)
prop_outside = count_outside/n
print(prop_outside)
## [1] 0.045
In R, the “|” operator means “or”.
You could also write the equality as an absolute deviation from the mean. Any measurement more than 4 units from 65 (in either direction) would be outside the interval.
x = read.csv("check_interval_typical_measurements.csv")$x
n = length(x)
count_outside = sum( abs(x-65)>4 )
prop_outside = count_outside/n
print(prop_outside)
## [1] 0.045
In a spreadsheet you can use the IF function along with the OR function to determine which measurements are under 61 or over 69. You then use the SUM function to count the 1s.
You can download this solution CSV.
Another (simpler?) way is to use the COUNTIF function with the ABS function.
You can download this second solution CSV.
A sample was gathered.
You can download the data as a CSV file.
Determine the sample interquartile range (IQR).
Warning: various definitions of IQR exist, based on arbitrary decisions made in defining the quantile function or other definitions of quartiles. I will make the answer’s tolerance large enough to accept most (hopefully all) methods.
This method is described in the wikipedia page on IQR.
This method relies on first determining the size of each half.
You determine the medians of the lowest half of the values and the highest half of the values. The IQR is the difference of those medians.
In this case, , so is the median of the lowest 3 numbers and is the median of the highest 3 numbers.
Method 1 is easiest to do by hand.
Because there are 7 values, the first quartile is the median of the lowest 3 values and the third quartile is the median of the highest 3 values.
The IQR is the difference between and .
Unfortunately, the built-in QUARTILE function does not use the method of medians (more about this in Method 2).
We sort the data, determine the median of the lowest 3 values, determine the median of the highest 3 values, and take a difference.
You can see a solution spreadsheet.
Again, the built-in function does not follow the method of medians. So, Method 1 is actually kind of difficult with R. The following code should be relatively easy to understand… but it uses floor rounding, subsetting and the colon operator.
data = read.csv("get_IQR.csv")
x = data$x
n = length(x)
x_sorted = sort(x)
halfsize = floor(n/2)
Q1 = median(x_sorted[1:halfsize])
Q3 = median(x_sorted[(n-halfsize+1):n])
iqr = Q3-Q1
print(iqr)
## [1] 14.33
We can find the quartiles with built-in functions of a spreadsheet.
You’ll notice this gives a different answer than Method 1.
There are 9 different built-in methods in R.
# Personally, I like the 5th option... the default is 7... some smart researchers suggest 8...
x = read.csv("get_IQR.csv")$x
IQR1 = IQR(x,type=1)
IQR2 = IQR(x,type=2)
IQR3 = IQR(x,type=3)
IQR4 = IQR(x,type=4)
IQR5 = IQR(x,type=5)
IQR6 = IQR(x,type=6)
IQR7 = IQR(x,type=7)
IQR8 = IQR(x,type=8)
IQR9 = IQR(x,type=9)
cat(c(IQR1,IQR2,IQR3,IQR4,IQR5,IQR6,IQR7,IQR8,IQR9))
## 14.33 14.33 14.22 14.4025 12.165 14.33 10 12.88667 12.70625
#The default is type 7...
x = read.csv("get_IQR.csv")$x
IQR_default = IQR(x)
IQR_default
## [1] 10
To really understand what is happening, I think it helps to visualize the quantile functions. Let’s use types 1, 5, and 7. Also, remember the sorted sample:
## 30.54 31.16 39.71 40.32 45.38 45.49 49.89
Type 1 is based on the empirical cumululative distribution.
Types 5 and 7 are based on continuous versions of the empirical cumulative distribution.
Any of the following answers are accepted:
## 14.33 10 14.33 14.33 14.22 14.4025 12.165 14.33 10 12.88667 12.70625
A sample was gathered.
You can download the data as a CSV file.
Determine the sample’s mean absolute deviation (MAD) around the sample proportion. You can round your answer to the hundredths place.
First, determine the sample proportion (mean of 0s and 1s).
Determine the absolute deviations (distances from ).
| deviations = | AbsDev = | ||
|---|---|---|---|
| 1 | 0 | -0.5 | 0.5 |
| 2 | 0 | -0.5 | 0.5 |
| 3 | 1 | 0.5 | 0.5 |
| 4 | 1 | 0.5 | 0.5 |
| 5 | 1 | 0.5 | 0.5 |
| 6 | 1 | 0.5 | 0.5 |
| 7 | 0 | -0.5 | 0.5 |
| 8 | 0 | -0.5 | 0.5 |
Now, take the mean of the absolute deviations.
You can do this with a spreadsheet.
You can do this with R
x = read.csv("get_MAD.csv")$x
phat = mean(x)
deviations = x-phat
AbsDev = abs(deviations)
MAD = mean(AbsDev)
x
## [1] 0 0 1 1 1 1 0 0
phat
## [1] 0.5
deviations
## [1] -0.5 -0.5 0.5 0.5 0.5 0.5 -0.5 -0.5
AbsDev
## [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
MAD
## [1] 0.5
With a little bit of algebra, we can simplify the formula for . (I’ve not shown the algebra…)
In other words, if is the number of 0s and is the number of 1s, then So, in this case:
A sample was gathered.
You can download the data as a CSV file.
Determine the variance of the sample. You can round your final answer to the hundredths place.
First, determine the mean of the sample. Remember, if a sample is all 0s and 1s, then the sample’s mean is the same as the sample proportion ().
Determine the squared deviations (squared distances from ).
| deviations = | SqrDev = | ||
|---|---|---|---|
| 1 | 0 | -0.3 | 0.09 |
| 2 | 0 | -0.3 | 0.09 |
| 3 | 0 | -0.3 | 0.09 |
| 4 | 0 | -0.3 | 0.09 |
| 5 | 0 | -0.3 | 0.09 |
| 6 | 1 | 0.7 | 0.49 |
| 7 | 0 | -0.3 | 0.09 |
| 8 | 1 | 0.7 | 0.49 |
| 9 | 0 | -0.3 | 0.09 |
| 10 | 1 | 0.7 | 0.49 |
Now, take the mean of the squared deviations to determine the variance.
I will usually use (not conventional) to represent raw data of 0s and 1s (instead of ), because in the context of 0s and 1s, usually implies the number of successes in trials. When represents a count of successes in (independent) trials, a large sample of counts () follows a binomial distribution, which is a special case of more general distributions of sums or means, which tend to be normally distributed.
The Central Limit Theorem states that random averages (means) and random sums follow normal probability distributions. The expected value and standard deviation of the sampling distribution is either calculated from the underlying distribution’s parameters or guessed from a sample’s statistics.
Notice, when dealing with binomial distributions, the underlying Bernoulli distribution is rarely discussed, and even when it is, “” is not used. So, this notation is not conventional. The conventional notation for is , where the hat denotes “estimated”, because sample proportion is an estimate of the underlying population proportion .
You can do this with a spreadsheet.
And, actually, you can skip a lot of work by using the VAR.P function. Notice, you use the population function even though the data is a sample. This is because, with proportions, the mean and variance are intrinsically linked (not independent).
In fact, we will see that the sample variance of 0s and 1s can be calculated from the sample proportion.
You can do this with R
w = read.csv("get_VAR.csv")$w
phat = mean(w)
deviations = w-phat
sqrdev = deviations^2
VAR = mean(sqrdev)
w
## [1] 0 0 0 0 0 1 0 1 0 1
phat
## [1] 0.3
deviations
## [1] -0.3 -0.3 -0.3 -0.3 -0.3 0.7 -0.3 0.7 -0.3 0.7
sqrdev
## [1] 0.09 0.09 0.09 0.09 0.09 0.49 0.09 0.49 0.09 0.49
VAR
## [1] 0.21
# The built-in var() function almost works, but it is too fancy, and makes a Bessel correction. To use it, we need to undo the Bessel correction.
n = length(w)
var(w)*(n-1)/(n)
## [1] 0.21
As mentioned earlier, there is an intrinsic link between (the mean of 0s and 1s) and (the variance of 0s and 1s).
Let represent the number of 0s and represent the number of ones. We can find a simple formula for variance (when data is 0s and 1s [Bernoulli]).
So, you could have just found . Then used the formula.
A sample was gathered.
You can download the data as a CSV file.
Determine the standard deviation of the sample. Because the data are all 0s and 1s, you would never make Bessel’s correction (even if guessing the population’s standard deviation), because the mean and standard deviation are not independent parameters. You can round your final answer to the hundredths place.
First, determine the mean of the sample. Remember, if a sample is all 0s and 1s, then the sample’s mean is the same as the sample proportion ().
Determine the squared deviations (squared distances from ).
| deviations = | SqrDev = | ||
|---|---|---|---|
| 1 | 1 | 0.625 | 0.390625 |
| 2 | 1 | 0.625 | 0.390625 |
| 3 | 0 | -0.375 | 0.140625 |
| 4 | 1 | 0.625 | 0.390625 |
| 5 | 1 | 0.625 | 0.390625 |
| 6 | 0 | -0.375 | 0.140625 |
| 7 | 0 | -0.375 | 0.140625 |
| 8 | 1 | 0.625 | 0.390625 |
| 9 | 0 | -0.375 | 0.140625 |
| 10 | 1 | 0.625 | 0.390625 |
| 11 | 0 | -0.375 | 0.140625 |
| 12 | 0 | -0.375 | 0.140625 |
| 13 | 0 | -0.375 | 0.140625 |
| 14 | 0 | -0.375 | 0.140625 |
| 15 | 0 | -0.375 | 0.140625 |
| 16 | 0 | -0.375 | 0.140625 |
Now, take the mean of the squared deviations to determine the variance.
The standard deviation is the square root of the variance.
I will usually use (not conventional) to represent raw data of 0s and 1s (instead of ), because in the context of 0s and 1s, usually implies the number of successes in trials. When represents a count of successes in (independent) trials, a large sample of counts () follows a binomial distribution, which is a special case of more general distributions of sums or means, which tend to be normally distributed.
The Central Limit Theorem states that random averages (means) and random sums follow normal probability distributions. The expected value and standard deviation of the sampling distribution is either calculated from the underlying distribution’s parameters or guessed from a sample’s statistics.
Notice, when dealing with binomial distributions, the underlying Bernoulli distribution is rarely discussed, and even when it is, “” is not used. So, this notation is not conventional. The conventional notation for is , where the hat denotes “estimated”, because sample proportion is an estimate of the underlying population proportion .
You can do this with a spreadsheet.
And, actually, you can skip a lot of work by using the STDEV.P function. Notice, you use the population function even though the data is a sample. This is because, with proportions, the mean and standard deviation are intrinsically linked (not independent).
In fact, we will see that the sample standard deviation of 0s and 1s can be calculated from the sample proportion.
You can do this with R
w = read.csv("get_SD.csv")$w
phat = mean(w)
deviations = w-phat
sqrdev = deviations^2
VAR = mean(sqrdev)
SD = sqrt(VAR)
w
## [1] 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 0
phat
## [1] 0.375
deviations
## [1] 0.625 0.625 -0.375 0.625 0.625 -0.375 -0.375 0.625 -0.375
## [10] 0.625 -0.375 -0.375 -0.375 -0.375 -0.375 -0.375
sqrdev
## [1] 0.390625 0.390625 0.140625 0.390625 0.390625 0.140625 0.140625
## [8] 0.390625 0.140625 0.390625 0.140625 0.140625 0.140625 0.140625
## [15] 0.140625 0.140625
VAR
## [1] 0.234375
SD
## [1] 0.4841229
# The built-in sd() function almost works, but it is too fancy, and makes a Bessel correction. To use it, we need to undo the Bessel correction.
n = length(w)
sd(w)*sqrt((n-1)/(n))
## [1] 0.4841229
As mentioned earlier, there is an intrinsic link between (the mean of 0s and 1s) and (the variance of 0s and 1s).
Let represent the number of 0s and represent the number of ones. We can find a simple formula for variance (when data is 0s and 1s [Bernoulli]).
So, you could have just found . Then used the formula.
A sample was taken from an unknown population. The values were organized into a boxplot.
For simplicity, assume no measurements lie on the hinges, median, or whisker tips (so we do not worry about inclusive vs. exclusive boundaries). This assumption is approximately true with a very large sample from a continuous distribution.
You need to know that each region (whisker or half-box) contains 25% of the measurements.
Five different populations were sampled, and the measurements were visualized as five boxplots. (Note: typical boxplots indicate outliers with dots. For simplicity, these boxplots include all outliers in the whiskers.)
Match the five boxplots with their appropriate description.
A continuous random variable (spinner/random number generator/infinite population) can be visualized with a density curve, a spinner, and a cumulative curve.
For each problem, you can use any of the visualizations. In short, the answers:
## 0.5 0.52 0.64 66 74 2
In statistics, the word “normal” does not mean “typical”. Instead, “normal” refers to a very important continuous distribution: the normal distribution. Normal distributions are important because random sums and random averages are approximately normal (see central limit theorem). For example, if you repeatedly roll 100 dice, taking the sum of each 100, those sums will be normally distributed (even though single rolls are discrete-uniformly distributed).
A normal distribution has a bell-shaped density curve. The center and spread of the bell are dictated by two parameters: mean () and standard deviation (). The normal density curve is defined by the following equation: where is the mean, is the standard deviation, is the ratio of a circle’s circumference to its diameter, and is Euler’s number.
For the first exam, we are expected to know the empirical rule. We are going to round unconventionally, so for us it could be called the “68-96-100 rule”. It implies that in a normal distribution 68% of the measurements are within 1 standard deviation of the mean, 96% of the measurements are within 2 standard deviations of the mean, and 100% of the measurements are within 3 standard deviations of the mean.
We can visualize the “68-96-100” rule with a density curve. Notice the area of each region is shown, and you can estimate the areas by counting percentage boxes.
We can display our 68-96-100 rule with a spinner.
And, we can display our 68-96-100 rule with a cumulative curve. In this case, we will introduce the notation of using as the multiplier of . For example, if , then the measurement is . We call the standard score.
Population (with infinite individuals) has measurements that are normally distributed with mean and standard deviation . Use the empirical rule (68-96-100 rule) to answer the following questions.
It helps to draw a diagram using the supplied mean () and standard deviation ().
A standard score () can be calculated from a measurement (), the population mean (), and the population standard deviation (). With a little algebra, you can create another formula solved for the measurement.
A gambler is interested in the sum of 160 rolls of 4-sided dice. The process of summing 160 rolls of 4-sided dice can be repeated infinitely many times, giving independent results each time, so those sums can be thought of as an infinitely large population. This population happens to be approximately normal (see central limit theorem).
The gambler knows how to calculate the population mean (see discrete-uniform distribution). She also knows how to calculate the population standard deviation.
A sample’s statistic typically approaches its corresponding population parameter, but with a small sample size there is regularly error. Let’s explore this idea with an example.
A geometric distribution is a well-studied discrete population. The spinner below represents a geometric distribution with the following population parameters.
That spinner was spun many times. The raw data is displayed below and can be downloaded as a CSV.
11, 1, 1, 17, 3, 1, 18, 2, 3, 1, 0, 6, 2, 12, 16, 4, 5, 17, 9, 0, 11, 0, 0, 2, 12, 6, 1, 0, 9, 0, 4, 10, 0, 4, 16, 1, 2, 2, 2, 10, 0, 3, 8, 4, 8, 4, 9, 0, 53, 27, 0, 5, 0
Please calculate the following sample statistics.
Notice the sample mean and sample standard deviation do not match the population mean and population standard deviation exactly.
A normal distribution is a well-studied continuous population. The spinner below represents a normal distribution with the following population parameters.
That spinner was spun many times. The raw data is displayed below and can be downloaded as a CSV.
46.1334, 46.5424, 47.8209, 35.8491, 41.2869, 51.1657, 46.4394, 44.1339, 43.8125, 44.9495, 44.8426, 60.0421, 51.0485, 55.4653, 55.2529, 37.767, 54.026, 63.5898, 34.3265, 47.3499, 52.8916, 46.2517, 45.3903, 64.2519, 40.937, 60.1959, 53.3353, 53.0661, 49.5026, 58.5484, 55.6279, 52.2783, 36.2523, 55.7074, 34.5496, 37.9945, 62.39, 67.77, 37.5741, 39.4837, 43.6251, 33.3043, 51.8723, 49.5009, 47.3149, 60.925, 59.945, 51.1887
Please calculate the following sample proportions. (All answers can be rounded to nearest hundredth.)
A standard eight-sided die was rolled many times, and the results were organized into the histogram shown below.
Three large samples were taken from three different populations. Their distributions are shown as histograms.
All three distributions look like they have similar bell shape, but their centers and spreads are all different.
Three large samples were taken from three different populations. Their distributions are shown as histograms.
All three distributions look like they have similar ranges (widths), but different shapes. So, we will use the fact that for a given range, bell shape has the smallest standard deviation and bimodal has the largest standard deviation. This is because a bell shape has many measurements near the middle, whereas the bimodal shape has many measurements near the edges of its interval.
Five different populations were sampled, and the measurements were visualized as five boxplots. (Note: typical boxplots indicate outliers with dots. For simplicity, these boxplots include all outliers in the whiskers.)
When measuring individuals from a population, we expect most measurements to be within the interval of typical measurements. We will define the interval of typical measurements (using interval notation): In other words, we expect we expect most measurements to be between two bounds.
A population of lizards has a mean length of cm and a standard deviation of cm. Determine the interval of typical measurements.
You need to use the formulas. Remember your order of operations!
We can visualize the “68-96-100” rule with a density curve. Notice the area of each region is shown, and you can estimate the areas by counting percentage boxes.
Population has measurements that are normally distributed with mean and standard deviation . Use the empirical rule (68-96-100 rule) to answer the following questions.
It helps to draw a diagram using the supplied mean () and standard deviation ().
A standard score () can be calculated from a measurement (), the population mean (), and the population standard deviation ().
The following (normal) spinner has population mean and population standard deviation .
A sample of size was taken from an unknown population.
3.13, 8.95, 8.53, 2.79, 7.41, 8.11, 9.97, 5.88, 8.32, 8.19,
1.80, 9.17, 8.28, 9.98, 5.46, 7.92, 9.84, 9.76, 9.83, 7.19,
9.42, 5.17, 9.64, 8.93, 9.33, 8.15, 5.50, 9.77, 4.66, 9.29,
6.11, 7.85, 4.25, 1.06, 9.99, 8.34, 7.05, 3.47, 9.18, 8.54,
6.33, 8.51, 8.59, 6.33, 6.69, 9.33, 8.95, 9.62, 6.29, 1.10,
4.18, 7.61, 7.76, 9.96, 7.77, 5.42, 9.01, 8.22, 9.09, 6.29,
6.85, 7.57, 9.77, 2.63, 5.30, 9.24, 6.38, 5.54, 8.23, 3.57,
8.88, 6.72, 7.92, 9.03, 9.54, 6.28, 0.68, 7.64, 9.43, 2.78,
9.67, 9.90, 5.71, 2.13, 9.20, 8.59, 9.09, 8.91, 6.27, 9.68,
7.65, 7.45, 9.88,10.00, 2.87, 8.64, 5.20, 9.97, 9.45, 5.26,
4.88, 8.99, 8.51, 8.16, 6.26, 9.92, 6.13, 6.89, 4.66, 9.73,
8.64, 3.05, 7.90, 5.81, 6.18, 7.78, 5.11, 3.55, 3.47, 8.65,
7.12, 4.50,10.00, 5.14, 4.71, 3.44, 8.41, 8.90, 8.41, 4.66,
9.46, 8.49, 9.89, 9.49, 9.85, 8.81, 9.88, 5.90, 9.89, 8.98,
8.86, 2.96, 2.73, 8.74, 7.19, 9.91, 9.16, 5.76, 5.58, 5.05,
5.55, 9.85, 7.14, 7.74, 9.22, 2.38, 9.38, 3.13, 2.59, 7.61,
5.49, 9.61, 9.43, 7.09, 4.31, 9.02, 4.04, 9.21, 8.86, 4.00,
7.73, 2.48, 8.30, 8.46, 8.77, 9.91, 4.55, 8.11, 4.93, 5.33,
6.33, 7.84, 7.44, 8.33, 7.86, 2.12, 4.73, 1.39, 8.17, 9.50,
8.79, 9.55, 6.41, 3.58, 8.60, 7.98, 6.95, 7.91,10.00, 9.30,
8.50, 3.22, 2.43, 8.57, 9.23, 4.02, 7.51, 8.59, 9.97, 7.40,
7.47, 9.92, 7.83, 9.90, 9.69, 9.99, 8.18, 2.74, 7.30, 9.81,
5.64, 3.90, 9.62, 4.12, 9.35, 3.17, 2.37, 2.32, 9.61, 9.99,
7.98, 9.80, 5.58, 2.69, 4.08, 5.90, 5.93, 9.64, 1.70, 6.93,
8.03, 9.63, 9.19, 1.34, 9.13, 3.55, 4.62, 1.98, 2.45, 8.35,
7.36, 8.12, 9.98, 7.03, 8.07, 8.31, 1.65, 7.44, 7.47, 9.16,
1.08, 4.48, 6.76, 5.39, 9.84, 6.44, 6.91, 5.93, 6.66, 6.13,
8.63, 9.97, 6.06, 8.91, 5.15, 3.50, 9.93, 4.64, 9.97, 9.71,
5.61, 2.08, 6.91, 9.82, 5.93, 4.90, 8.17, 7.26, 9.38,10.00,
9.98, 7.18, 9.65, 4.25, 9.68,10.00, 9.34, 8.23, 9.93, 9.88,
9.71, 2.52, 9.30, 5.83, 4.73, 7.59, 3.31, 2.88, 3.29, 9.36,
9.25, 5.06, 4.99, 9.51, 9.42, 2.12, 9.71, 8.97, 9.62, 9.84,
4.52, 6.58, 6.99, 1.28, 0.63, 9.78, 8.10, 8.96, 6.83, 9.52,
3.23, 8.92, 9.95, 9.93, 2.78, 8.98, 9.37, 8.21, 8.64, 8.33,
2.28, 9.55, 9.55, 7.29, 7.72, 8.02, 5.41, 8.06, 0.93, 9.07,
6.25, 3.53, 8.79, 5.95, 5.55, 8.93, 9.52, 7.61, 9.82, 3.24,
2.70, 5.53, 8.79, 7.90, 5.51, 8.72, 5.96, 7.66, 8.68, 9.99,
9.82, 4.69, 9.36, 2.51, 6.62, 8.78, 3.27, 6.43, 9.19, 7.31,
7.72, 8.66, 9.99, 9.46, 5.48, 9.39, 9.95, 4.96, 8.71, 8.91,
1.47, 5.96, 2.92, 9.13, 8.89, 9.86, 8.80, 8.85, 6.93, 9.94
You can download the data as a CSV. Determine which histogram visualizes the data, and describe the shape of the data.
You should make a histogram. This is easy in R.
x = read.csv("make_hist.csv")$x
hist(x)
Using a spreadsheet is way more work. But you could just make a frequency distribution and decide from there.
In a deck of strange cards, there are 426 cards. Each card has an image and a color. The amounts are shown in the table below and can be downloaded as a csv.
(Answers can be rounded to nearest hundredth.)
The key logical terms are “and”, “or”, and “given”. Notice that I am using “given” as a shorter version of “under the condition”.
A spinner was constructed:
The spinner’s probability distribution is shown below.
| 10 | 0.21 |
| 12 | 0.09 |
| 13 | 0.49 |
| 15 | 0.15 |
| 20 | 0.06 |
It can also be downloaded as a csv.
Make a table (for parts mean and standard deviation).
| 10 | 0.21 | 2.1 | -3 | 9 | 1.89 |
| 12 | 0.09 | 1.08 | -1 | 1 | 0.09 |
| 13 | 0.49 | 6.37 | 0 | 0 | 0 |
| 15 | 0.15 | 2.25 | 2 | 4 | 0.6 |
| 20 | 0.06 | 1.2 | 7 | 49 | 2.94 |
| ========= | ========= | ========= | ========= | ========= | ========= |
A pizza shop has 14 different toppings available. You will choose 4 different toppings for your pizza. How many possibilities exist?
This scenario describes a combinations problem (order of selection does not matter). We are considering the subsets of size 4 from a set of size 14.
Remember, we care about combinations because they represent all the ways we can select 1s and 0s. So, in this case, tells us there are 1001 ways of selecting 10 0s and 4 1s. (Think of the 1s as the toppings that are selected and 0s as the toppings NOT selected.)
So, if you had a lot of time, you could list out all possibilities:
| Count | Possibility |
|---|---|
| 1 | 0 1 1 0 0 0 0 1 0 0 0 1 0 0 |
| 2 | 0 1 0 0 0 1 1 0 0 0 1 0 0 0 |
| 3 | 0 0 0 0 0 0 1 0 0 1 0 1 1 0 |
| 4 | 0 1 0 1 0 0 0 0 0 1 0 0 0 1 |
| 5 | 0 0 1 0 0 0 0 0 0 1 0 1 0 1 |
| 997 | 0 0 0 0 0 1 1 0 1 0 0 0 1 0 |
| 998 | 0 0 0 0 1 0 0 1 0 1 0 0 0 1 |
| 999 | 1 0 0 1 0 0 1 1 0 0 0 0 0 0 |
| 1000 | 0 1 0 0 1 1 0 0 1 0 0 0 0 0 |
| 1001 | 0 0 0 0 0 0 0 1 1 1 1 0 0 0 |
Of course, you’d want to be more systematic than that…
Joe is shopping for shirts. Joe likes 17 of the shirts, but will only buy 4 of them. How many possibilities exist?
This scenario describes a combinations problem (order does not matter). We are considering the subsets of size 4 from a set of size 17.
A company needs to select a CFO, a president, and a secretary. Each position will be held by a different person. The company is considering the same pool of 24 applicants for each position. How many possibilities exist?
This scenario describes a permutations problem (order matters). We are considering the nonrepeating sequences of size 3 from a set of size 24.
If you had a lot of time, you could list out all possibilities (using 1 for a CFO, 2 for a president…):
| Count | Possibility |
|---|---|
| 1 | 0 0 0 0 1 0 0 0 0 3 0 0 0 0 0 0 2 0 0 0 0 0 0 0 |
| 2 | 0 3 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
| 3 | 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 |
| 4 | 0 3 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 |
| 5 | 0 0 0 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 2 0 0 0 0 |
| 12140 | 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 0 0 0 0 0 |
| 12141 | 1 2 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 |
| 12142 | 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 2 0 0 0 1 0 0 |
| 12143 | 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 0 |
| 12144 | 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 2 |
Of course, you’d want to be more systematic than that.
A team has 20 players. The coach will give out 5 different prizes to different players. How many ways could the coach do this?
This scenario describes a permutations problem (order matters). We are considering the nonrepeating sequences of size 5 from a set of size 20.
In some situation, each trial has 0.61 probability of success. There will be 10 trials. (Thus the number of successes will follow a binomial distribution.)
This is a binomial distribution, so use the appropriate formulas. where is the probability of success on each trial, is a specific number of successes, is the number of trials, and is the combinations operator (so that ). Some people prefer to also use as the probability of failure, such that .
You will also need to add mutually exclusive probabilities (when multiple -values satisfy the probability’s condition). It is also helpful to be aware of the complement rule.
Bob has a 0.69 chance of winning a game. If Bob wins, he has a 0.48 chance of being happy after the game. If Bob loses, he has a 0.21 chance of being happy after the game.
After the game, you notice Bob is happy. What is the probability that Bob won his game? (Do not answer as a percentage; answer as a decimal.)
Use the definition of conditional probability. We can first determine all the joint probabilities.
Notice there are two disjoint ways Bob could be happy. So, back to the conditional probability.
Cindy has two games today. Each game she will either win or lose. She has a 0.34 chance of winning the first game and a 0.79 chance of winning the second game.
Determine the probability that the standard normal variable is less than -0.87. In other words, evaluate .
The numbers that satisfy are on the left side of a number line (toward ). The probability equals a left area under the density curve.
By using the z-table, we find the appropriate probability.
It might help to visualize with a spinner:
Using a spreadsheet:
=NORM.DIST(-0.87,0,1,TRUE)
Using R:
pnorm(-0.87)
Determine the probability that the standard normal variable is more than 0.05. In other words, evaluate .
First, you need to identify that we are looking for a right area. This is because large values of satisfy .
By using the z-table, we find the left area (even though we eventually want the right area).
We use the rule of compliments to determine the right area.
Method 2: You might also recognize that the normal distribution is symmetric. Thus,
Method 3: It often helps to draw a picture.
It might help to visualize with a spinner:
Using a spreadsheet:
=1-NORM.DIST(0.05,0,1,TRUE)
Using R:
1-pnorm(0.05)
Determine the probability that the absolute standard normal variable is less than 0.89. In other words, evaluate .
First, you need to identify that we are looking for a central area. This is because -scores near 0 satisfy , while -scores far from 0 (either positive or negative) do not satisfy the inequality.
Start with a sketch.
Remember, the entire area is always 1. Most z-tables only provide left areas, so three methods are shown to get the central area from a left-area (cumulative probability) table.
Method 1: First find the area of the left tail. Recognize normal distributions are symmetric, so we also know the area of the right tail. The three areas add to 1.
This technique can be summarized with the following formula: assuming .
So, in our case (when ),
This method is shown graphically:
Method 2: You can also find half of the central area and then double it.
This method is shown graphically:
Method 3: You can also calculate the central area with a difference of two left areas.
It might be helpful to visualize with a spinner.
In a spreadsheet, you could use the NORM.DIST() function.
=NORM.DIST(0.89,0,1,TRUE) - NORM.DIST(-0.89,0,1,TRUE)
In R, you could use the pnorm function.
pnorm(0.89) - pnorm(-0.89)
Determine the probability that the absolute standard normal variable is more than 0.56. In other words, evaluate .
First, you need to identify that we are looking for a two-tail area (the sum of the left and right tails). This is because -scores far from 0 satisfy , while -scores near 0 do not satisfy the inequality.
Start with a sketch.
Remember, the entire area is always 1. Most z-tables only provides left areas, so three methods are shown to get the two-tail area from a left-area (cumulative probability) table.
Method 1: First find the area of the left tail. Recognize normal distributions are symmetric, so we also know the area of the right tail. The two areas add to our desired two-tail area.
This technique can be summarized with the following formula: assuming .
So, in our case (when ),
This method is shown graphically:
Notice we need to add both tails.
Method 2: You can achieve the same result by using the following formula: So,
It might be helpful to visualize with a spinner.
In a spreadsheet, you could use the NORM.DIST() function.
=2*NORM.DIST(-0.56,0,1,TRUE)
In R, you could use the pnorm function.
2*pnorm(-0.56)
Determine the probability that the standard normal variable is between -1.35 and -0.16. In other words, evaluate .
Start with a sketch.
We take a difference of areas.
In a spreadsheet, you could use the NORM.DIST() function.
=NORM.DIST(-0.16,0,1,TRUE) - NORM.DIST(-1.35,0,1,TRUE)
In R, you could use the pnorm function.
pnorm(-0.16) - pnorm(-1.35)
Determine such that . In other words, what -score is greater than % of standard normal values? (Answers within 0.01 from the correct value will be marked correct.)
Start with a sketch. Leftward numbers (toward ) will be less than our boundary , so we shade a left region with area 0.41.
You should go to your -table and find the -score with the left area closest to 0.41.
| -0.25 | 0.4013 |
| -0.24 | 0.4052 |
| -0.23 | 0.409 |
| -0.22 | 0.4129 |
| -0.21 | 0.4168 |
| -0.2 | 0.4207 |
It turns out the exact answer is , which could be found by using an inverse normal function. On a spreadsheet:
=Norm.Inv(0.41,0,1)
Using R:
qnorm(0.41)
But, the -table is accurate enough, so I will accept either -0.23 or -0.22 (anything within 0.01 of -0.227545).
You might find it helpful to visualize with a spinner.
Determine such that . In other words, what -score is less than % of standard normal values? (Answers within 0.01 from the correct value will be marked correct.)
Start with a sketch. Rightward numbers (toward ) will be more than our boundary , so we shade a rightward region with area 0.94.
You should first find the left area.
plot of chunk unnamed-chunk-2
You should go to your -table and find the -score with the left area closest to 0.06.
| -1.58 | 0.0571 |
| -1.57 | 0.0582 |
| -1.56 | 0.0594 |
| -1.55 | 0.0606 |
| -1.54 | 0.0618 |
| -1.53 | 0.063 |
It turns out the exact answer is , which could be found by using an inverse normal function. On a spreadsheet:
=Norm.Inv(0.06,0,1)
Using R:
rightarea = 0.94
leftarea = 1-rightarea
qnorm( leftarea )
But, the -table is accurate enough, so I will accept either -1.56 or -1.55 (anything within 0.01 of -1.5547736).
You might find it helpful to visualize with a spinner.
Determine such that . In other words, how far from 0 should boundaries be set such that 54% of standard normal values are between those boundaries? (Answers within 0.01 from the correct value will be marked correct.)
Start with a sketch.
Method 1: Determine the area of each tail. Both tails have the same area, and all three areas add to 1. Thus,
You should go to your -table and find the -score with the left area closest to 0.77.
| 0.71 | 0.7611 |
| 0.72 | 0.7642 |
| 0.73 | 0.7673 |
| 0.74 | 0.7704 |
| 0.75 | 0.7734 |
| 0.76 | 0.7764 |
It turns out the exact answer is , which could be found by using an inverse normal function. On a spreadsheet:
=Norm.Inv(0.77,0,1)
Using R:
centralarea = 0.54
leftarea = (1-centralarea)/2 + centralarea
qnorm( leftarea )
But, because we are using the -table, I will accept either 0.73 or 0.74. (Or, really anything within 0.01 of 0.7388468.)
Method 2: Another way to get 0.77 is by adding half of 0.54 to 0.5.
Then, use the table. Or, you could use R:
centralarea = 0.54
leftarea = 0.5 + centralarea/2
qnorm( leftarea )
It might be helpful to visualize with a spinner.
Determine such that . In other words, how far from 0 should boundaries be set such that 20% of standard normal values are outside those boundaries? (Answers within 0.01 from the correct value will be marked correct.)
Start with a sketch. The total two-tail area is 0.2, so each tail has half that area.
Method 1: Determine the area of each tail and the center. Both tails have the same area, and all three areas add to 1. Thus,
Find the left area.
You should go to your -table and find the -score with the left area closest to 0.9.
| 1.26 | 0.8962 |
| 1.27 | 0.898 |
| 1.28 | 0.8997 |
| 1.29 | 0.9015 |
| 1.3 | 0.9032 |
| 1.31 | 0.9049 |
It turns out the exact answer is , which could be found by using an inverse normal function. On a spreadsheet:
=Norm.Inv(0.9,0,1)
Using R:
twotailarea = 0.2
onetailarea = twotailarea/2
centralarea = 1-twotailarea
leftarea = onetailarea + centralarea
qnorm( leftarea )
Method 2: Another way to get 0.9 is by subtracting half of 0.2 from 1.
Then, use the table. Or, R:
twotailarea = 0.2
leftarea = 1 - twotailarea/2
qnorm( leftarea )
You might find a spinner visualization useful.
A farm produces 4 types of fruit: kiwis, plums, apricots, and apples. The fruits’ masses follow normal distributions, with population parameters dependent on the type of fruit.
| _ Type of fruit _ | _ Mean mass (g) _ | _ Standard deviation of mass (g) _ |
|---|---|---|
| kiwis | 95 | 8 |
| plums | 105 | 8 |
| apricots | 43 | 4 |
| apples | 214 | 12 |
One specimen of each type is weighed. The results are shown below.
| _ Specimen type _ | _ Mass of specimen (g) _ |
|---|---|
| kiwi | 103.2 |
| plum | 100.8 |
| apricot | 42 |
| apple | 203.9 |
The population parameters and specimen masses can be downloaded as a csv.
For each measurement, determine the standard score and the cumulative probability. Then determine which specimen is most unusually large, most unusually small, most typically sized, and most unusually sized.
The formula to determine the -score of a measurement is the ratio with numerator the difference between measurement and population mean and denominator the population standard deviation.
The highest -score (furthest right on number line) corresponds to the most unusually large measurement.
The smallest -score (furthest left on number line) corresponds to the most unusually small measurement.
The smallest absolute -score corresponds to the most usually sized measurement.
The largest absolute -score corresponds to the most unusually sized measurement.
Random variable is normally distributed with mean and standard deviation . Evaluate .
First, draw a sketch. We can label the axis by adding integer multiples of 10 to 40. We know to shade toward the left because small values of satisfy the condition .
We are given a specific value as a boundary. (Remember, for random variables we use uppercase letters, but for specific values we use lowercase.) We calculate the value of the boundary.
We have rephrased our problem into a standard normal probability problem, because
So, we just need to evaluate . To do this, you just need a -table.
| -0.32 | 0.3745 |
| -0.31 | 0.3783 |
| -0.3 | 0.3821 |
| -0.29 | 0.3859 |
| -0.28 | 0.3897 |
Thus, we find our answer.
Random variable is normally distributed with mean and standard deviation . Evaluate .
First, draw a sketch. We can label the axis by adding integer multiples of 21 to 86. We know to shade toward the right because large values of satisfy the condition .
We are given a specific value as a boundary. (Remember, for random variables we use uppercase letters, but for specific values we use lowercase.) We calculate the value of the boundary.
We have rephrased our problem into a standard normal probability problem, because
So, we just need to evaluate .
To do this, you need to remember that right-area events are complementary to left-area events.
You can use a -table.
| 0.98 | 0.8365 |
| 0.99 | 0.8389 |
| 1 | 0.8413 |
| 1.01 | 0.8438 |
| 1.02 | 0.8461 |
Thus, we find our answer.
Random variable is normally distributed with mean and standard deviation . Evaluate . In other words, what is the probability that is within units from the mean?
First, draw a sketch. We can label the axis by adding integer multiples of 4 to 75. We know to shade the center because values near 75 satisfy the condition . We draw the boundaries at and because those are the solutions to . We can also rephrase the probability.
We calculate the values of the boundaries. Left boundary:
Right boundary:
We have rephrased our problem into a standard normal probability problem, because
So, we just need to evaluate . I will also point out that . In general, if is normally distributed, then:
From here we have a formula that lets us use the table. (We practiced this part before.)
Random variable is normally distributed with mean and standard deviation . Evaluate . In other words, what is the probability that is outside units from the mean?
First, draw a sketch. We can label the axis by adding integer multiples of 0.05 to 0.16. We know to shade the two tails because values far from 0.16 satisfy the condition . We draw the boundaries at and because those are the solutions to . We can also rephrase the probability.
We calculate the values of the boundaries. Left boundary:
Right boundary:
We have rephrased our problem into a standard normal probability problem, because
So, we just need to evaluate . I will also point out that . In general, if is normally distributed, then:
From here we have a formula that lets us use the table. (We practiced this part before.)
Random variable is normally distributed with mean and standard deviation . Evaluate . In other words, what is the probability that is between and ?
First, draw a sketch. We can label the axis by adding integer multiples of 40 to 180. We know to shade the between and because those values satisfy the condition .
We calculate the values of the boundaries. Left boundary:
Right boundary:
We rephrase our problem into a standard normal probability problem:
So, we just need to evaluate .
From here we have a formula that lets us use the table. (We practiced this part before.)
Random variable is normally distributed with mean and standard deviation . Evaluate such that . In other words, determine an upper boundary such that a normal spinner with mean 9.6 and standard deviation 2.6 lands under that boundary 39% of the time.
First, draw a sketch. We can label the axis by adding integer multiples of 2.6 to 9.6. We know to shade the left because low values of satisfy the condition (regardless of the exact value of ). We don’t know exactly where to place the boundary, but we know the left area is 0.39.
It is helpful to know the following approximations:
| -3 | 0.001 |
| -2 | 0.023 |
| -1 | 0.159 |
| 0 | 0.5 |
| 1 | 0.841 |
| 2 | 0.977 |
| 3 | 0.999 |
So, we know the -score is between -1 and 0. Remember, and always refer to the standard normal variable.
By using the table we can determine more precisely.
| -0.3 | 0.3821 |
| -0.29 | 0.3859 |
| -0.28 | 0.3897 |
| -0.27 | 0.3936 |
| -0.26 | 0.3974 |
| -0.29 | 0.4013 |
Either -0.28 or -0.27 is a good estimation of , and either value will lead you to an acceptable answer. Using other tools, a more accurate value can be found. I will show the work with a more accurate value.
We now convert the score into a score.
We can also visualize this with a spinner.
The tolerance for an acceptable answer was from 8.8737705. So, anything between 8.7737705 and 8.9737705 was accepted.
Random variable is normally distributed with mean and standard deviation . Evaluate such that . In other words, determine an lower boundary such that a normal spinner with mean 35 and standard deviation 7 lands on a value more than that boundary 94% of the time.
First, draw a sketch. We can label the axis by adding integer multiples of 7 to 35. We know to shade the right because high values of satisfy the condition (regardless of the exact value of ). We don’t know exactly where to place the boundary, but we know the right area is 0.94.
We know how to find the left area.
As an intermediate step, we find such that .
By using the table we can determine .
| -1.58 | 0.0571 |
| -1.57 | 0.0582 |
| -1.56 | 0.0594 |
| -1.55 | 0.0606 |
| -1.54 | 0.0618 |
| -1.57 | 0.063 |
Either -1.56 or -1.55 is a good estimation of , and either value will lead you to an acceptable answer. Using other tools, a more accurate value can be found. I will show the work with a more accurate value.
We now convert the score into a score.
We can also visualize this with a spinner.
The tolerance for an acceptable answer was from 24.1165848. So, anything between 23.1165848 and 25.1165848 was accepted.
Random variable is normally distributed with mean and standard deviation .
First, draw a sketch. We can label the axis by adding integer multiples of 1.7 to 8.3. We know to shade the the middle because values near 8.3 satisfy the condition (regardless of the exact value of ). We don’t know exactly where to place the boundaries, but we know the central area is 0.2.
As an intermediate step, let’s find such that . First, we need to evaluate .
You could have also drawn some pictures… we know there is symmetry and all the areas should add to 1.
We find such that .
By using the table we can determine .
| 0.23 | 0.591 |
| 0.24 | 0.5948 |
| 0.25 | 0.5987 |
| 0.26 | 0.6026 |
| 0.27 | 0.6064 |
| 0.24 | 0.6103 |
Either 0.25 or 0.26 is a good estimation of , and either value will lead you to an acceptable answer. Using other tools, a more accurate value can be found. I will show the work with a more accurate value.
We now convert the score into a score.
We can also visualize this with a spinner.
plot of chunk unnamed-chunk-4
Random variable is normally distributed with mean and standard deviation .
First, draw a sketch. We can label the axis by adding integer multiples of 130 to 600. We know to shade the the outsides because values far from 600 satisfy the condition (regardless of the exact value of ). We don’t know exactly where to place the boundaries, but we know the two-tail area is 0.7.
As an intermediate step, let’s find such that . First, we need to evaluate .
You could have also drawn some pictures… we know there is symmetry and all the areas should add to 1.
We find such that .
By using the table we can determine .
| 0.36 | 0.6406 |
| 0.37 | 0.6443 |
| 0.38 | 0.648 |
| 0.39 | 0.6517 |
| 0.4 | 0.6554 |
| 0.37 | 0.6591 |
Either 0.38 or 0.39 is a good estimation of , and either value will lead you to an acceptable answer. Using other tools, a more accurate value can be found. I will show the work with a more accurate value.
We now convert the score into a score.
We can also visualize this with a spinner.
This question provides , , , and . You will characterize the sampling distribution, calculate the standard score, determine the percentile rank (expressed as a decimal), and translate the score into a rating (see below).
When Archie practices archery, each arrow has the same probability distribution (see i.i.d). This means her skill is constant, there is no hot-hand effect, and there is no maturity of chances.
Over many months, Archie has shot ten thousand arrows (their positions are shown below as dots).
From those many arrows, Archie has determined an accurate probability distribution (of the points scored by an arrow).
Thus, Archie can determine her population mean and population standard deviation.
Each day, Archie shoots 48 arrows () and determines that day’s mean score. A mean of 10 would be a perfect day. We can treat each day’s mean as a random variable with its own probability distribution: a sampling distribution. We wish to characterize this sampling distribution. From the central limit theorem, we know the sampling distribution is approximately normal; however, we need to calculate the parameters.
The expected mean is simply the population mean.
“Expected mean” is a misnomer. It is not necessarily likely, or even possible, for a mean to equal the expected mean. However, if Archie repeatedly shot 48 arrows, we expect the means to have a average equal to the expected mean. So, maybe “average of means” would be better terminology.
This expected mean is the average of the sampling distribution.
Determine the expected mean:
(Round to the hundredths place)
The standard error of a mean is the quotient of the population standard deviation and the square root of the sample size. This standard error is the standard deviation of the sampling distribution.
Calculate the standard error:
(Round to the thousandths place)
Today, Archie’s mean score is 9.062 points ().
Archie would like to know how well she did today. She wants you to calculate a standard score (a score). To calculate the standard score of a sample mean, you can use the following formula.
Calculate the standard score:
(Round to the hundredths place)
Archie would like to know the probability that tomorrow she shoots worse than today. To estimate this, report the cumulative probability associated with the -score you calculated. This can be done with a table, pnorm function in R, NORM.DIST function in a spreadsheet, or with other standard normal tools.
Calculate , the cumulative probability:
(Round to the ten-thousandths place)
Archie would like a rating for her day’s performance. You decide to use the following scale and that if is on a boundary, Archie will be given the higher rating.
| interval | rating |
|---|---|
| to -1.51 | F |
| -1.5 to -0.51 | D |
| -0.5 to 0.49 | C |
| 0.5 to 1.49 | B |
| 1.5 to | A |
Determine Archie’s rating: A / B / C / D / F
The sampling distribution is approximately normal with mean 9.04 and standard error 0.15. The sampling distribution can be visualized, along with today’s mean (9.062) highlighted in red. The cumulative probability is the sum of the probabilities of scores less than (or equal) 9.062. Thus, the area highlighted in blue represents the cumulative probability.
This question provides , , , and . You will characterize the sampling distribution, calculate the standard score, determine the percentile rank (expressed as a decimal), and translate the score into a rating (see below).
When Archie practices archery, each arrow has the same probability distribution (see i.i.d). This means her skill is constant, there is no hot-hand effect, and there is no maturity of chances.
Over many months, Archie has shot ten thousand arrows (their positions are shown below as dots).
From those many arrows, Archie has determined an accurate probability distribution (of the points scored by an arrow).
Thus, Archie can determine her population mean and population standard deviation.
Each day, Archie shoots 108 arrows () and determines that day’s total score. A total of 1080 would be a perfect day. We can treat each day’s total as a random variable with its own probability distribution: a sampling distribution. We wish to characterize this sampling distribution. From the central limit theorem, we know the sampling distribution is approximately normal; however, we need to calculate the parameters.
The expected total is the product of the sample size and the population mean.
“Expected total” is a misnomer. It is not necessarily likely, or even possible, for a total to equal the expected total. However, if Archie repeatedly shot 108 arrows, we expect the totals to have a mean equal to the expected total. So, maybe “average of totals” would be better terminology.
This expected total is the average of the sampling distribution.
Calculate the expected total:
(Round to the hundredths place)
The standard error of a total is the product of the population standard deviation and the square root of the sample size. This standard error is the standard deviation of the sampling distribution.
Calculate the standard error:
(Round to the hundredths place)
Today, Archie’s total score is 896 points ().
Archie would like to know how well she did today. She wants you to calculate a standard score (a score). To calculate the standard score of a sample total, you can use the following formula.
Calculate the standard score:
(Round to the hundredths place)
Archie would like to know the probability that tomorrow she shoots worse than today. To estimate this, report the cumulative probability associated with the standard score you calculated. This can be done with a table, pnorm function in R, NORM.DIST function in a spreadsheet, or with other standard normal tools.
Calculate , the cumulative probability:
(Round to the ten-thousandths place)
Archie would like a rating for her day’s performance. You decide to use the following scale and that if is on a boundary, Archie will be given the higher rating.
| interval | rating |
|---|---|
| to -1.5 | F |
| -1.5 to -0.5 | D |
| -0.5 to 0.5 | C |
| 0.5 to 1.5 | B |
| 1.5 to | A |
Determine Archie’s rating: A / B / C / D / F
The sampling distribution is approximately normal with mean 856.44 and standard error 17.25. The sampling distribution can be visualized, along with today’s total (896) highlighted in red. The cumulative probability is the sum of the probabilities of scores less than (or equal) 896. Thus, the area highlighted in blue represents the cumulative probability.
A farm produces 4 types of fruit: coconuts, lemons, mangos, and oranges. The fruits’ masses have population parameters dependent on the type of fruit. (All values are in grams.)
| _ Type of fruit _ | _ Population mean () _ | _ Population standard deviation () _ |
|---|---|---|
| coconuts | 673 | 59 |
| lemons | 129 | 15 |
| mangos | 179 | 18 |
| oranges | 223 | 15 |
A sample of each type is weighed. The results are shown below.
| _ Type of fruit _ | _ Sample size () _ | _ Sample mean () _ |
|---|---|---|
| coconut | 144 | 675.1 |
| lemon | 49 | 132 |
| mango | 121 | 177.4 |
| orange | 100 | 224 |
The population parameters and sample statistics can be downloaded as a csv.
For each sample, determine the mean’s standard score and the mean’s cumulative probability by assuming the requirements for the central limit theorem are met. Then determine which sample mean is most unusually large, most unusually small, most typically sized, and most unusually sized.
The standard score of a sample mean: Some people prefer to call this denominator the standard error. So the standard score (-score) of a sample mean can also be expressed as
The formula to determine the -score of a sample mean is the ratio with numerator the difference between measurement and population mean and denominator the standard error of the mean.
The highest -score (furthest right on number line) corresponds to the most unusually large sample mean.
The smallest -score (furthest left on number line) corresponds to the most unusually small sample mean.
The smallest absolute -score corresponds to the most usually sized sample mean.
The largest absolute -score corresponds to the most unusually sized sample mean.
By using the formulas, you should get a spreadsheet like the one displayed here:
plot of chunk unnamed-chunk-1
data = read.csv("fruit.csv", as.is=TRUE)
fruit = data$fruit
mu = data$mu
sigma = data$sigma
n = data$n
xbar = data$xbar
SEM = sigma/sqrt(n)
z = round( (xbar-mu)/SEM ,2)
cum_prob = round( pnorm(z) ,4)
data.frame(data,SEM,z,cum_prob,abs(z))
## fruit mu sigma n xbar SEM z cum_prob abs.z.
## 1 coconut 673 59 144 675.1 4.916667 0.43 0.6664 0.43
## 2 lemon 129 15 49 132.0 2.142857 1.40 0.9192 1.40
## 3 mango 179 18 121 177.4 1.636364 -0.98 0.1635 0.98
## 4 orange 223 15 100 224.0 1.500000 0.67 0.7486 0.67
# Most unusually large
fruit[z==max(z)]
## [1] "lemon"
# Most unusually small
fruit[z==min(z)]
## [1] "mango"
# Most typically sized
fruit[abs(z)==min(abs(z))]
## [1] "coconut"
# Most unusually sized
fruit[abs(z)==max(abs(z))]
## [1] "lemon"
The continuous random variable follows the distribution shown by the density curve and spinner below. It has a mean of and standard deviation of .
That spinner () will be fairly spun 729 times, and the sample mean of the spins will be recorded. Determine the probability that the random sample mean is less than 48.03.
Please approximate the random mean as a normal distribution with parameters suggested by the central limit theorem.
We use the central limit formulas for a random mean. So, we think the random mean is normally distributed with a mean of 48.51 and a standard deviation of 0.32.
Calculate a -score for the boundary.
You can round that to or . Any of the following probabilities will get credit.
Let’s draw a sketch.
The continuous random variable follows the distribution shown by the density curve and spinner below. It has a mean of and standard deviation of .
That spinner () will be fairly spun 93 times, and the sample mean of the spins will be recorded. Determine the probability that the random sample mean is more than 90.49.
Please approximate the random mean as a normal distribution with parameters suggested by the central limit theorem.
We use the central limit formulas for a random mean. So, we think the random mean is normally distributed with a mean of 90.1 and a standard deviation of 0.3090116.
Calculate a -score for the boundary.
You can round that to or . Any of the following probabilities will get credit.
Let’s draw a sketch.
The continuous random variable follows the distribution shown by the density curve and spinner below. It has a mean of and standard deviation of .
That spinner () will be fairly spun 138 times, and the sample mean of the spins will be recorded. Determine the probability that the random sample mean is within units from .
Please approximate the random mean as a normal distribution with parameters suggested by the central limit theorem.
We use the central limit formulas for a random mean. So, we think the random mean is normally distributed with a mean of 34.84 and a standard deviation of 0.3439076.
Calculate a -score for the boundary.
You can round that to or . Any of the following probabilities will get credit.
Let’s draw a sketch.
The continuous random variable follows the distribution shown by the density curve and spinner below. It has a mean of and standard deviation of .
That spinner () will be fairly spun 79 times, and the sample mean of the spins will be recorded. Determine the probability that the random sample mean is farther than units from .
Please approximate the random mean as a normal distribution with parameters suggested by the central limit theorem.
We use the central limit formulas for a random mean. So, we think the random mean is normally distributed with a mean of 49.57 and a standard deviation of 0.2103914.
Calculate a -score for the boundary.
You can round that to or . Any of the following probabilities will get credit.
Let’s draw a sketch.
A fair 8-sided die (with sides numbered 1 through 8) will be rolled 64 times, and the sum (total) of those rolls will be recorded. What is the probability that the sum is less than 305.5?
To help you along, I will calculate the mean and standard deviation of single rolls. The formulas for an -sided die can be derived from the formulas for discrete uniform distribution.
Please use a normal approximation based on the central limit theorem.
We calculate the mean and standard deviation of the random sum using the formulas from the central limit theorem.
The central limit theorem tells us that the random sum is approximately normal with the parameters calculated above.
In other words, we can approximate the summing of 64 rolls of 8-sided dice with a single spin of the following spinner.
Find the appropriate score.
You can round this to either 0.95 or 0.96. I will continue with the unrounded score. We now rephrase the question as a standard normal probability.
You can use the table to find the probability.
We can sketch the density curve and shade the appropriate region.
The continuous random variable follows the distribution shown by the density curve and spinner below. It has a mean of and standard deviation of .
That spinner () will be fairly spun 184 times, and the total of the spins will be recorded. Determine the probability that the random total is more than 15400.
Please approximate the random total as a normal distribution with parameters suggested by the central limit theorem.
We use the central limit formulas for a random total. So, we think the random total is normally distributed with a mean of 15397.12 and a standard deviation of 77.7255016.
Calculate a -score for the boundary.
You can round that to or . Any of the following probabilities will get credit.
Let’s draw a sketch.
The continuous random variable follows the distribution shown by the density curve and spinner below. It has a mean of and standard deviation of .
That spinner () will be fairly spun 146 times, and the total of the spins will be recorded. Determine the probability that the random total is within 53.92 units from .
Please approximate the random total as a normal distribution with parameters suggested by the central limit theorem.
We use the central limit formulas for a random total. So, we think the random total is normally distributed with a mean of 8976.08 and a standard deviation of 88.3270661.
Calculate a -score for the boundary.
You can round that to or . Any of the following probabilities will get credit.
Let’s draw a sketch.
The continuous random variable follows the distribution shown by the density curve and spinner below. It has a mean of and standard deviation of .
That spinner () will be fairly spun 75 times, and the total of the spins will be recorded. Determine the probability that the random total is farther than 33.5 units from .
Please approximate the random total as a normal distribution with parameters suggested by the central limit theorem.
We use the central limit formulas for a random total. So, we think the random total is normally distributed with a mean of 2626.5 and a standard deviation of 29.2716586.
Calculate a -score for the boundary.
You can round that to or . Any of the following probabilities will get credit.
Let’s draw a sketch.
In some game, each trial has a probability of success. A player will attempt 190 trials. What is the probability that the number of successes is less than 27.5?
We determine the mean and standard deviation of the binomial distribution.
We determine a -score. (In the de Moivre-Laplace notes, I used to emphasize that a binomial variable is a sum of Bernoulli trials. Here, I will just use as the boundary for number of successes, because this is the more common notation.)
Then, find the standard normal probability.
Of course, you could have rounded , so the following will also get credit.
In some game, each trial has a probability of success. A player will attempt 90 trials. What is the probability that the number of successes is more than 52.5?
We determine the mean and standard deviation of the binomial distribution.
We determine a -score. (In the de Moivre-Laplace notes, I used to emphasize that a binomial variable is a sum of Bernoulli trials. Here, I will just use as the boundary for number of successes, because this is the more common notation.)
Then, find the standard normal probability.
Of course, you could have rounded , so the following will also get credit.
In some game, each trial has a probability of success. A player will attempt 195 trials. What is the probability that the number of successes is between 91.5 and 100.5?
We determine the mean and standard deviation of the binomial distribution.
We determine both -scores. (In the de Moivre-Laplace notes, I used to emphasize that a binomial variable is a sum of Bernoulli trials. Here, I will just use as the boundary for number of successes, because this is the more common notation.) We get the first -score.
We get the second -score.
Then, find the standard normal probability.
Of course, you could have rounded the -scores, so any of the following will also get credit.
In some game, each trial has a probability of success. A player will attempt 98 trials. What is the probability that the proportion of successes is less than 0.6276?
We determine the mean and standard deviation of the proportion sampling distribution.
We determine a -score. (The common notation is for a specific proportion.)
Then, find the standard normal probability.
Of course, you could have rounded , so the following will also get credit.
In some game, each trial has a probability of success. A player will attempt 135 trials. What is the probability that the proportion of successes is more than 0.9444?
We determine the mean and standard deviation of the proportion sampling distribution.
We determine a -score. (The common notation is for a specific proportion.)
Then, find the standard normal probability.
Of course, you could have rounded , so the following will also get credit.
In some game, each trial has a probability of success. A player will attempt 191 trials. What is the probability that the proportion of successes is between 0.7199 and 0.7356?
We determine the mean and standard deviation of the proportion sampling distribution.
We determine both -scores. We get the first -score.
We get the second -score.
Then, find the standard normal probability.
Of course, you could have rounded the -scores, so any of the following will also get credit.
The following questions use to refer to the standard normal variable. You will determine some probabilities and some boundaries.
To do this problem, you should practice the Standard Normal exercises.
Let random variable be normally distributed with mean and standard deviation .
Random variable has mean and standard deviation .
Let the interval of typical measurements be defined as having lower bound and upper bound .
Calculate the lower bound of the interval of typical measurements.
Calculate the upper bound of the interval of typical measurements.
Let the interval of typical totals be defined as having lower bound and upper bound .
For the listed values of , determine the bounds.
| _ lower bound of typical totals _ | _ upper bound of typical totals _ | |
|---|---|---|
| 25 | ||
| 100 | ||
| 400 |
Let the interval of typical averages be defined as having lower bound and upper bound .
For the listed values of , determine the bounds.
| _ lower bound of typical averages _ | _ upper bound of typical averages _ | |
|---|---|---|
| 25 | ||
| 100 | ||
| 400 |
| _ lower bound of typical totals _ | _ upper bound of typical totals _ | |
|---|---|---|
| 25 | 1335 | 1415 |
| 100 | 5420 | 5580 |
| 400 | 21840 | 22160 |
| _ lower bound of typical averages _ | _ upper bound of typical averages _ | |
|---|---|---|
| 25 | 53.4 | 56.6 |
| 100 | 54.2 | 55.8 |
| 400 | 54.6 | 55.4 |
Random variable follows a Bernoulli distribution with . In this context, each spin is often called a trial. A “0” is a “fail” and a “1” is a “success”.
The average of a Bernoulli random variable is equal to . Determine .
The standard deviation of a Bernoulli random variable is equal to . Determine . (Round to thousandths place.)
The binomial distribution predicts how many successes there will be for a given number of trials. Let the interval of typical successes be defined as having lower bound and upper bound .
For the listed values of , determine the bounds. (Round to nearest integer.)
| _ lower bound of typical successes _ | _ upper bound of typical successes _ | |
|---|---|---|
| 25 | ||
| 100 | ||
| 400 |
Let the interval of typical proportions be defined as having lower bound and upper bound .
For the listed values of , determine the bounds. (Round to nearest hundredth.)
| _ lower bound of typical proportions _ | _ upper bound of typical proportions _ | |
|---|---|---|
| 25 | ||
| 100 | ||
| 400 |
| _ lower bound of typical successes _ | _ upper bound of typical successes _ | |
|---|---|---|
| 25 | 5 | 15 |
| 100 | 30 | 50 |
| 400 | 140 | 180 |
| _ lower bound of typical proportions _ | _ upper bound of typical proportions _ | |
|---|---|---|
| 25 | 0.2 | 0.6 |
| 100 | 0.3 | 0.5 |
| 400 | 0.35 | 0.45 |
A farm produces 4 types of fruit: mangos, bananas, oranges, and plums. The fruits’ masses have population parameters dependent on the type of fruit. (All values are in grams.)
| _ Type of fruit _ | _ Population mean () _ | _ Population standard deviation () _ |
|---|---|---|
| mangos | 219 | 19 |
| bananas | 153 | 10 |
| oranges | 191 | 24 |
| plums | 64 | 6 |
A sample of each type is weighed. The results are shown below.
| _ Type of fruit _ | _ Sample size () _ | _ Sample mean () _ |
|---|---|---|
| mango | 81 | 221.6 |
| banana | 121 | 152.3 |
| orange | 100 | 188.6 |
| plum | 144 | 64.73 |
The population parameters and sample statistics can be downloaded as a csv.
For each sample, determine the mean’s standard score and the mean’s cumulative probability by assuming the requirements for the central limit theorem are met. Then determine which sample mean is most unusually large, most unusually small, most typically sized, and most unusually sized.
The standard score of a sample mean: Some people prefer to call this denominator the standard error. So the standard score (-score) of a sample mean can also be expressed as
The formula to determine the -score of a sample mean is the ratio with numerator the difference between measurement and population mean and denominator the standard error of the mean.
The highest -score (furthest right on number line) corresponds to the most unusually large sample mean.
The smallest -score (furthest left on number line) corresponds to the most unusually small sample mean.
The smallest absolute -score corresponds to the most usually sized sample mean.
The largest absolute -score corresponds to the most unusually sized sample mean.
By using the formulas, you should get a spreadsheet like the one displayed here:
data = read.csv("fruit.csv", as.is=TRUE)
fruit = data$fruit
mu = data$mu
sigma = data$sigma
n = data$n
xbar = data$xbar
SEM = sigma/sqrt(n)
z = round( (xbar-mu)/SEM ,2)
cum_prob = round( pnorm(z) ,4)
data.frame(data,SEM,z,cum_prob,abs(z))
## fruit mu sigma n xbar SEM z cum_prob abs.z.
## 1 mango 219 19 81 221.60 2.1111111 1.23 0.8907 1.23
## 2 banana 153 10 121 152.30 0.9090909 -0.77 0.2206 0.77
## 3 orange 191 24 100 188.60 2.4000000 -1.00 0.1587 1.00
## 4 plum 64 6 144 64.73 0.5000000 1.46 0.9279 1.46
# Most unusually large
fruit[z==max(z)]
## [1] "plum"
# Most unusually small
fruit[z==min(z)]
## [1] "orange"
# Most typically sized
fruit[abs(z)==min(abs(z))]
## [1] "banana"
# Most unusually sized
fruit[abs(z)==max(abs(z))]
## [1] "plum"
This question provides , , , and . You will characterize the sampling distribution, calculate the standard score, determine the percentile rank (expressed as a decimal), and translate the score into a rating (see below).
When Archie practices archery, each arrow has the same probability distribution (see i.i.d). This means her skill is constant, there is no hot-hand effect, and there is no maturity of chances.
Over many months, Archie has shot ten thousand arrows (their positions are shown below as dots).
From those many arrows, Archie has determined an accurate probability distribution (of the points scored by an arrow).
Thus, Archie can determine her population mean and population standard deviation.
Each day, Archie shoots 48 arrows () and determines that day’s total score. A total of 480 would be a perfect day. We can treat each day’s total as a random variable with its own probability distribution: a sampling distribution. We wish to characterize this sampling distribution. From the central limit theorem, we know the sampling distribution is approximately normal; however, we need to calculate the parameters.
The expected total is the product of the sample size and the population mean.
“Expected total” is a misnomer. It is not necessarily likely, or even possible, for a total to equal the expected total. However, if Archie repeatedly shot 48 arrows, we expect the totals to have a mean equal to the expected total. So, maybe “average of totals” would be better terminology.
This expected total is the average of the sampling distribution.
Calculate the expected total:
(Round to the hundredths place)
The standard error of a total is the product of the population standard deviation and the square root of the sample size. This standard error is the standard deviation of the sampling distribution.
Calculate the standard error:
(Round to the hundredths place)
Today, Archie’s total score is 442 points ().
Archie would like to know how well she did today. She wants you to calculate a standard score (a score). To calculate the standard score of a sample total, you can use the following formula.
Calculate the standard score:
(Round to the hundredths place)
Archie would like to know the probability that tomorrow she shoots worse than today. To estimate this, report the cumulative probability associated with the standard score you calculated. This can be done with a table, pnorm function in R, NORM.DIST function in a spreadsheet, or with other standard normal tools.
Calculate , the cumulative probability:
(Round to the ten-thousandths place)
Archie would like a rating for her day’s performance. You decide to use the following scale and that if is on a boundary, Archie will be given the higher rating.
| interval | rating |
|---|---|
| to -1.5 | F |
| -1.5 to -0.5 | D |
| -0.5 to 0.5 | C |
| 0.5 to 1.5 | B |
| 1.5 to | A |
Determine Archie’s rating: A / B / C / D / F
The sampling distribution is approximately normal with mean 436.8 and standard error 6.93. The sampling distribution can be visualized, along with today’s total (442) highlighted in red. The cumulative probability is the sum of the probabilities of scores less than (or equal) 442. Thus, the area highlighted in blue represents the cumulative probability.
Random variable is normally distributed with mean and standard deviation .
In some game, each trial has chance of success. In other words, is a random Bernoulli variable with parameter . To determine the following probabilities, use the de Moivre–Laplace theorem (normal approximation). I have already done the continuity correction by setting the boundaries.
You could first determine the mean and standard deviation of the Bernoulli variable .
You can use a table. Remember, the degree of freedom (df or ) is one less than .
You can use T.DIST and T.INV to calculate the values. But remember these functions return/use LEFT-area probabilities.
You can use pt and qt to calculate the values. But remember these functions return/use LEFT-area probabilities.
A scientist has weighed 49 specimens of a newly discovered organism. Those weights have a sample mean of grams and a sample standard deviation of grams. The scientist hopes to construct a 95% confidence interval of the organism’s population mean ().
The scientist will later consult a statistician for a more precise method, but for now she will use a quick method to estimate the 95% confidence interval:
(You can round to nearest 0.1 grams.)
Plug the numbers into the expressions.
A scientist has grown 170 specimens under novel conditions and found that 12.94% of them survived (in other words ). The scientist hopes to construct a 95% confidence interval of the survival rate.
The scientist will later consult a statistician for a more precise method, but for now she will use a quick method to estimate the 95% confidence interval:
(You can round answers to the nearest thousandth.)
Plug the numbers into the expressions.
In trials, there were successes, so the sample proportion is . You are tasked with determining the confidence interval (of the population proportion) with confidence level .
To do this, you first determine such that . Then, evaluate to determine the boundaries of the confidence interval.
(You can round answers to the hundredths place.)
First, determine such that . It can help to draw a sketch.
The entire area under the density curve is 1. Thus, we can determine the areas of the tails (using symmetry).
Determine a leftward area with boundary .
We rephrase the puzzle. We wish to determine such that . This is easy to determine using a -table, qnorm in R, norm.inv in spreadsheets, or other methods for evaluating the quantile function.
## Using R's qnorm:
qnorm(0.91)
## [1] 1.340755
In the original phrasing,
So, . Now, use the given expressions to evaluate the boundaries.
A population’s mean () is unknown, but its standard deviation is known: . A sample of size is taken, and the sample mean is calculated: . You are tasked with determining a confidence interval using a given confidence level: .
To do this, you need to first determine such that . Then, the boundaries are determined by evaluating .
(You can round to the nearest hundredth and the boundaries to the nearest tenth.)
First, determine such that . It can help to draw a sketch.
The entire area under the density curve is 1. Thus, we can determine the areas of the tails (using symmetry).
Determine a leftward area with boundary . It should be mentioned the expression can be simplified.
We rephrase the puzzle. We wish to determine such that . This is easy to determine using a -table, qnorm in R, norm.inv in spreadsheets, or other methods for evaluating the quantile function.
In the original phrasing,
So, . Now, use the given expressions.
A spreadsheet can use the confidence.norm function. It takes three arguments:
The function returns the margin of error.
So, in this case, if in a spreadsheet you typed =confidence.norm(1-0.85,28.3,216), the result would be the margin of error, 2.7719202. You then add/subtract the margin of error to/from the sample mean.
You can do this with a spreadsheet.
xbar = 89.5
sigma = 28.3
n = 216
gamma = 0.85
zstar = qnorm(gamma+(1-gamma)/2)
ME = zstar*sigma/sqrt(n)
LB = xbar-ME
UB = xbar+ME
cat(sprintf("The lower bound: %.4f\nThe upper bound: %.4f",LB,UB))
## The lower bound: 86.7281
## The upper bound: 92.2719
A population’s mean () and standard deviation () are unknown, but the population is approximately normal. A sample of size is taken, and the sample mean is calculated: . The sample standard deviation is also calculated: . You are tasked with determining a confidence interval using a given confidence level: .
To do this, you need to first determine such that . Then, the boundaries are determined by evaluating .
(You can round to the nearest hundredth and the boundaries to the nearest tenth.)
First, determine such that . It can help to draw a sketch.
The entire area under the density curve is 1. Thus, we can determine the areas of the tails (using symmetry).
Thus, we determine a left area. It should be mentioned the expression can be simplified.
We rephrase the puzzle. We wish to determine such that . This is easy to determine using a -table, qt in R, t.inv in spreadsheets, or other methods for evaluating the quantile function.
In the original phrasing,
So, . Now, use the given expressions.
A spreadsheet can use the confidence.t function. It takes three arguments:
The function returns the margin of error.
So, in this case, if in a spreadsheet you typed =confidence.t(1-0.96,7.9,15), the result would be the margin of error, 4.6175959. You then add/subtract the margin of error to/from the sample mean.
You can do this with a spreadsheet.
xbar = 28.6
s = 7.9
n = 15
gamma = 0.96
df = n-1 # df is the degrees of freedom
tstar = qt(0.5+gamma/2, df)
ME = tstar*s/sqrt(n)
LB = xbar-ME
UB = xbar+ME
cat(sprintf("The lower bound: %.4f\nThe upper bound: %.4f",LB,UB))
## The lower bound: 23.9824
## The upper bound: 33.2176
A sample was taken from a population. The measurements are shown below and can be downloaded as a csv.
563, 561, 666, 615, 622, 590, 527, 614, 612, 577, 625, 557, 606, 601, 554, 603, 628, 569, 578, 650, 568, 631
You are tasked with determining the confidence interval (of the population mean) with the confidence level of .
You can download a solution spreadsheet. The top 13 rows are shown below:
x = c(563, 561, 666, 615, 622, 590, 527, 614, 612, 577, 625, 557, 606, 601, 554, 603, 628, 569, 578, 650, 568, 631)
n = length(x)
xbar = mean(x)
s = sd(x)
SE = s/sqrt(n)
gamma = 0.96 # Probability T is between -tstar and tstar
cumulative = gamma+(1-gamma)/2 # Probability T is less than tstar
tstar = qt(cumulative, n-1)
ME = tstar*s/sqrt(n)
LB = xbar-ME
UB = xbar+ME
print(data.frame(n,xbar,s,SE,tstar,ME,LB,UB,row.names=""))
n xbar s SE tstar ME LB UB
22 596.2273 34.70273 7.398646 2.189427 16.1988 580.0285 612.4261
When Archie shoots archery, she records the horizontal position () and vertical position () of every arrow (in millimeters), using the bullseye as the origin.
From years of shooting, Archie has determined that and are roughly bell-shaped with population standard deviations and mm. However, Archie has a new sight, so her current population means ( and ) are unknown. (She hopes both are zero.)
Archie wants to get confidence intervals for and , using a sample size and a confidence level . Whether or not a confidence interval straddles 0 will determine whether Archie adjusts that aspect of her sight.
Determine the boundary such that :
The margin of error represents how much variation we expect in sample means. Calculate , the margin of error when sampling :
Calculate , the margin of error for sampling :
Archie shoots arrows.
The exact positions can be downloaded as a csv.
Calculate , the horizontal sample mean: Calculate , the vertical sample mean:
Calculate the 98% confidence interval of by using .
Calculate the lower boundary:
Calculate the upper boundary:
Calculate the 98% confidence interval of by using .
Calculate the lower boundary:
Calculate the upper boundary:
If a confidence interval straddles 0, Archie will leave that aspect alone. If a confidence interval does not straddle 0, Archie will adjust that aspect on the sight.
Does Archie adjust the horizontal aspect of her sight? Yes / No
Does Archie adjust the vertical aspect of her sight? Yes / No
A basketball player has decided to estimate her probability to score a freethrow. To do this, she shoots freethrows. If she scores, she records a “1”. If she misses, she records a “0”.
The results are shown below and can be downloaded as a csv.
1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
You are asked to determine a confidence interval using a given confidence level: . To do this, you determine such that . You also calculate the sample size and sample proportion. Then, you use the following formulas:
You should be able to determine that , so . Using a spreadsheet or R, you should determine that and . Then, use the given formulas.
gamma = 0.96
zstar = qnorm(0.5+gamma/2)
x = c(1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
n = length(x)
phat = mean(x)
LB = phat-zstar*sqrt(phat*(1-phat)/n)
UB = phat+zstar*sqrt(phat*(1-phat)/n)
print(data.frame(LB,UB,row.names=""))
## LB UB
## 0.6906134 0.837958
You can download a solution spreadsheet.
The first 10 lines are shown here:
We can approximate a 95% confidence interval (where the 95% refers to the confidence level: how frequently these intervals straddle the population mean) by using . where is the sample mean, is the sample standard deviation, and is the sample size. The 2 comes from the fact that 95% of normal measurements land within 2 standard deviations from the mean.
The quantity that we subtract from or add to the sample mean is called the margin of error.
When using a confidence level of 0.95, knowing will be approximately 65, and wanting the margin of error to be approximately 2.2, how large does the sample size need to be?
You can round your answer to two significant digits.
Do some algebra. Multiply both sides by . Divide both sides by . Square both sides. Plug in numbers. Evaluate.
The tolerance is , so you can round to 2 significant digits, giving .
We can approximate a 95% confidence interval of a proportion (where the 95% refers to how frequently these intervals straddle the population proportion) by using . where is the sample proportion and is the sample size. The 2 comes from the fact that 95% of normal measurements land within 2 standard deviations from the mean.
The quantity that we subtract from or add to the sample proportion is called the margin of error.
If we know will be approximately 0.57, and we want the margin of error to be approximately 0.0043, then how large does the sample size need to be?
You can round your answer to two significant digits.
Do some algebra. Multiply both sides by . Divide both sides by . Square both sides. Simplify.
Plug in numbers. Evaluate.
The tolerance is , so you can round to 2 significant digits, giving .
A scientist is investigating whether a chemical may effect the growth of an organism. Under the control conditions (no chemical), the organism grows to a mean mass of grams with a standard deviation of grams. These values are known precisely because the organism has been grown under control conditions many many times.
The scientist has only grown the organism under experimental conditions (with chemical) times. In that sample, the masses have a mean .
The scientist wonders if this sample mean is significantly different from . To investigate this, the scientist will determine the -value. The -value represents the probability of getting a sample mean as far (or farther) from due to chance alone.
It is common to compare the -value to 0.05.
We need to calculate the -value.
Of course you could have rounded the score. Either of the following will also get credit.
A scientist is investigating whether a chemical may effect the survival rate of an organism. Under the control conditions (no chemical), the organism has a survival rate of . This value is known precisely because the organism has been grown under control conditions many many times.
The scientist has only grown the organism under experimental conditions (with chemical) times. In that sample, the survival rate is .
The scientist wonders if this survival rate is significantly different from . To investigate this, the scientist will determine the -value. The -value represents the probability of getting a sample proportion as far (or farther) from due to chance alone.
It is common to compare the -value to 0.05.
We need to calculate the -value.
Of course you could have rounded the score. Either of the following will also get credit.
Archie worries the horizontal aspect of her sight may be off. She decides to run a one-sample t test on the values of her next 50 arrows. She will use a significance level of . If the result has statistical significance, Archie will adjust her sight.
The horizontal positions are shown below and can be downloaded as a csv.
-24, 13, 39, 50, 13, -29, 50, -41, 28, 110, 36, -10, 99, -108, -43, 28, -9, 23, -99, -61, -56, 3, 79, -38, 52, -100, -174, -38, -36, -77, 89, -21, -138, -20, -50, -93, -58, 12, -74, 65, -33, 47, -102, 121, -16, 9, -73, 37, -9, -136
To get the -value:
In this situation, the null population mean is zero. In other words, the null hypothesis claims the sight is correct and the difference between and 0 is just due to chance. The alternative hypothesis claims the sight needs to be adjusted: the difference between and 0 is partly because the sight is off.
Determine the -value.
Archie will adjust her sight if the -value is less than 0.05, because this indicates there is statistical significance. Does Archie adjust her sight? Yes. / No.
Determine the sample statistics ( and ). I would recommend using a spreadsheet or R (or another computer-based method). Evaluate the statistic.
Restate the -value. Remember, the degree of freedom is one less than the sample size.
At this point, there are various ways to determine the -value. The least accurate way is to use the -table. Go to the row with .
If we go to the row with df=49, we can see that our calculated , 1.61, is between 1.3 and 1.68. Thus, we know is between 0.1 and 0.2. You could get pretty close by using linear interpolation.
The more accurate -value is 0.1132777. This can be calculated using a computer; for example, you can use this web app (for full accuracy, you’ll need the more precise value of ). You can also use a spreadsheet or R.
You can do this problem VERY quickly with R.
x = c(-24,13,39,50,13,-29,50,-41,28,110,36,-10,99,-108,-43,28,-9,23,-99,-61,-56,3,79,-38,52,-100,-174,-38,-36,-77,89,-21,-138,-20,-50,-93,-58,12,-74,65,-33,47,-102,121,-16,9,-73,37,-9,-136)
t.test(x)
##
## One Sample t-test
##
## data: x
## t = -1.6125, df = 49, p-value = 0.1133
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -34.277829 3.757829
## sample estimates:
## mean of x
## -15.26
Connie suspects a coin may be unfair when spun on its edge on a table. She decides to record some spins, using “0” for tails and “1” for heads. After those spins, she will run a one-proportion hypothesis test using a significance level .
The data is shown below, and can be downloaded as a csv.
1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0
The null hypothesis states the coin is fair and any deviation between the sample proportion and 0.5 is merely due to chance and natural variation. Thus, the null population proportion is .
The alternative hypothesis states the coin is unfair, so the deviation between the sample proportion and 0.5 is at least partly due to the unfairness of the coin.
The -value will indicate the probability that a fair coin produces a sample mean as extreme (or more extreme) in its deviation from 0.5. An approximate formula, using a normal approximation of the proportion sampling and not using a continuity correction, is given:
If you want to make a continuity correction, your -value is more accurate (and larger, more conservative).
And, if you want to be exactly correct, you need a computer (or a lot of time) to use the binomial distribution formulas.
For this problem, you can use any of those three strategies.
Determine the -value.
Connie compares the -value to the significance level, . If the -value is less than 0.05, Connie concludes the coin is unfair. Otherwise, Connie will conclude the coin MIGHT be fair but future measurements may still show the coin is unfair.
Does Connie conclude the coin is unfair?
Yes, the sample proportion is significantly far from 0.5, so Connie thinks the coin is unfair. / No, the sample proportion is NOT significantly far from 0.5, so Connie retains the belief that the coin MIGHT be fair.
First, I apologize for the notation, but for some reason this notation is ubiquitous. We have 4 different “p” variables.
You need to determine the sample size () and sample proportion ().
You then calculate a -score. For simplicity, we will not make the continuity correction if we are doing this by hand.
Restate the -value.
You can use a table to determine the cumulative probability of .
To calculate the -value, you need to remember how to determine a two-tail probability from a cumulative (leftward) probability.
You can use any of the three methods shown in the spreadsheet:
You can download a solution spreadsheet.
All three methods can be done easily in R:
x = c(1,0,1,0,1,1,1,0,1,1,0,1,1,0,0,0,1,1,1,0,0,1,1,0,1,1,1,0,1,1,0,0,1,0,1,0,1,0,0,1,0,1,1,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,1,1,0,0,0,1,1,1,0,1,0,1,0,1,1,0,1,1,1,1,1,0,0,0,0,1,0,1,1,1,1,1,1,0,0,1,1,0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,0,1,1,1,1,1,1,0,0,1,0,1,0,1,0,0,0,0,0,1,1,0,0,0,1,1,0,1,1,1,0,0,1,0,0,1,0,0,1,0,1,1,1,1,1,0,0,0,1,1,0,0,1,1,1,1,0,1,1,0,0,1,0,1,0,0)
prop.test(sum(x),length(x),0.5,correct=F)
##
## 1-sample proportions test without continuity correction
##
## data: sum(x) out of length(x), null probability 0.5
## X-squared = 2.4915, df = 1, p-value = 0.1145
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4856922 0.6304316
## sample estimates:
## p
## 0.559322
prop.test(sum(x),length(x),0.5)
##
## 1-sample proportions test with continuity correction
##
## data: sum(x) out of length(x), null probability 0.5
## X-squared = 2.2599, df = 1, p-value = 0.1328
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4828804 0.6331471
## sample estimates:
## p
## 0.559322
binom.test(sum(x),length(x),0.5)
##
## Exact binomial test
##
## data: sum(x) and length(x)
## number of successes = 99, number of trials = 177, p-value =
## 0.1325
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.4828872 0.6337364
## sample estimates:
## probability of success
## 0.559322
A study asked individuals to time a mile run (in seconds). After a month, the same individuals timed another mile run. You are asked to perform a paired-data test to investigate whether fitness changed.
| name | x1 | x2 |
|---|---|---|
| Charlotte | 654 | 641 |
| Emani | 751 | 717 |
| Hudson | 305 | 253 |
| Jaelani | 305 | 229 |
| Jayden | 945 | 919 |
| Julianne | 499 | 557 |
| Lea | 362 | 285 |
| Luke | 608 | 672 |
| Ruby | 485 | 402 |
| Zaina | 499 | 415 |
To do this, first determine the list of differences. For each individual, , determine their difference .
Then, determine the mean () and standard deviation () of the differences. The score is then calculated to determine the -value.
The null hypothesis predicts there is no change in fitness, so . The alternative hypothesis predicts a change in fitness. The degree of freedom is one less than the number of individuals.
Calculate the -value.
Using a significance level of , is there a significant change in run times?
Yes / No
You first need to determine a list of differences.
| name | x1 | x2 | d = x2-x1 |
|---|---|---|---|
| Charlotte | 654 | 641 | -13 |
| Emani | 751 | 717 | -34 |
| Hudson | 305 | 253 | -52 |
| Jaelani | 305 | 229 | -76 |
| Jayden | 945 | 919 | -26 |
| Julianne | 499 | 557 | 58 |
| Lea | 362 | 285 | -77 |
| Luke | 608 | 672 | 64 |
| Ruby | 485 | 402 | -83 |
| Zaina | 499 | 415 | -84 |
Determine the sample size and degrees of freedom. Determine the sample mean of the differences. Determine the standard deviation of the differences. You probably do not want to do this by hand. Calculate the -score. Restate the -value. I would recommend using a computer program to determine this -value, like a spreadsheet or R.
The -value is more than 0.05, so the result is NOT significant.
The solution spreadsheet can be downloaded as a csv. The first 10 rows are shown below.
If your spreadsheet does not have the TDIST function, you can try T.DIST.2T(G6,G3,1). Also, notice that T.TEST does everything for you, so you can just use that.
Make/use a directory (folder) for this problem (paired-data test), and set the working directory accordingly. Save run_times.csv to your working directory. I would recommend saving the following script as paired_data_hypothesis_test.r, in the same directory.
table = read.csv("run_times.csv")
x1 = table[['x1']]
x2 = table[['x2']]
t.test(x1,x2,paired=T)
##
## Paired t-test
##
## data: x1 and x2
## t = 1.8518, df = 9, p-value = 0.09707
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.157984 71.757984
## sample estimates:
## mean of the differences
## 32.3
Or, if you wanted to do it the long way:
table = read.csv("run_times.csv")
x1 = table[["x1"]]
x2 = table[["x2"]]
d = x2-x1
n = length(d)
t = abs(mean(d))/(sd(d)/sqrt(n))
cumulative = pt(t,n-1)
pvalue = 2*(1-cumulative)
pvalue
## [1] 0.09707493
A doctor runs a controlled experiment. The participants are randomly assigned to two groups: control and treatment. The participants in the control group are given a placebo. The participants in the treatment group are given a drug.
After a month, each participant’s triglyceride level (in mg/dL) is measured. Those measurements are shown below. They can also be downloaded as a csv.
## Control: 262.8, 276.4, 243, 237.9, 191.4, 264.4, 202.7, 235.1, 299.3, 300.3, 243.3, 316.9, 249, 332.3, 243.9, 222.5, 203.5, 290.9, 263.4, 271, 212.5
##
## Treatment: 372.5, 247.7, 278.5, 279.2, 313.1, 242.6, 235.5, 259.4, 270.9, 208.1, 219.2, 245.3, 271.1, 267.7, 338.5, 310.1, 291.8, 243.6, 289.1, 294.4, 289.9, 285.8
You are asked to perform a two-tail two-sample Welch’s test to determine whether there is a significant difference of means in the two samples.
To do this by hand, you would first determine the absolute score as defined here. You’d also need to calculate the degrees of freedom.
And then, the -value:
However, this problem is easy when using a spreadsheet or R, so I would recommend using one of those tools.
In a spreadsheet, you can use T.TEST with mode=2 for a two-tail test and type=3 for Welch’s test.
In R, you can use t.test with the default settings.
Determine the -value.
Is the difference of means significant (using a significance level of 0.05)?
Yes, the drug causes a difference in average triglyceride level. / No, we don’t know whether the drug causes a difference.
To do this by hand, you first calculate the sample statistics. You will get quite close if you round df down (floor).
Then, using a computer application or a table, you should be able to determine the following probabilities (using interpolation to estimate if using table).
You just need to use T.TEST with the proper settings. You can download the solution as a csv.
You just use t.test with the default settings. The hardest part is getting the data imported. You can do this in 2 ways: copy/paste or read.csv.
x1 = c(262.8,276.4,243,237.9,191.4,264.4,202.7,235.1,299.3,300.3,243.3,316.9,249,332.3,243.9,222.5,203.5,290.9,263.4,271,212.5)
x2 = c(372.5,247.7,278.5,279.2,313.1,242.6,235.5,259.4,270.9,208.1,219.2,245.3,271.1,267.7,338.5,310.1,291.8,243.6,289.1,294.4,289.9,285.8)
t.test(x1,x2)
##
## Welch Two Sample t-test
##
## data: x1 and x2
## t = -1.6977, df = 40.884, p-value = 0.09717
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -43.409861 3.760511
## sample estimates:
## mean of x mean of y
## 255.3571 275.1818
Download triglyceride.csv and move it to a directory (folder). Make a script, welchttest.r, and put it in the same directory. Set the working directory to this directory. Then, run the script.
### welchttest.r
data = read.csv("triglyceride.csv")
x1 = data$x1
x2 = data$x2
t.test(x1,x2)
##
## Welch Two Sample t-test
##
## data: x1 and x2
## t = -1.6977, df = 40.884, p-value = 0.09717
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -43.409861 3.760511
## sample estimates:
## mean of x mean of y
## 255.3571 275.1818
A doctor runs a controlled experiment. The sick patients are randomly assigned to two groups: control and treatment. The patients in the control group are given a placebo. The patients in the treatment group are given a drug.
After a month, each patient was checked for whether they recovered from the sickness. A “0” means no recovery while a “1” means recovery. This data can also be downloaded as a csv.
## Control: 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1
##
## Treatment: 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0
You are asked to perform a two-tailed two-proportion test. You can do a two-proportions test (which is equivalent to a 2x2 test [chi squared test]). You will get credit whether or not you apply a continuity correction, but please pool data for the standard error estimation.
Just for completeness, you would also get credit for using Fisher’s exact test.
Determine the -value.
Is the difference of means significant (using a significance level of 0.05)?
Yes, the drug causes a difference in recovery. / No, we don’t know whether the drug causes a difference.
There are many ways to do this problem. I will show the following:
chisq.test function on 2x2 contingency tableR’s prop.test and fisher.test functions.First, somehow you need to determine the sample sizes ( and ) and the sample totals (numbers of recovery), and . You could count… but a computer is probably helpful.
It can be helpful to organize these summary statistics into a contingency table.
| Recover | Not_recover | TOTAL | |
|---|---|---|---|
| Control | 53 | 23 | 76 |
| Treatment | 31 | 23 | 57 |
| TOTAL | 84 | 46 | 133 |
Calculate the proportions.
Determine the absolute score.
Using a table or an online standard-normal probability tool, determine the appropriate probabilities.
This -value did not use the continuity correction.
Solution download.
If you wanted to make Yates’ continuity correction, you would need to calculate the EXPECTED table the same, but then make another table, where the observed values are all 0.5 closer to the expected values. Then, you would use =CHISQ.TEST on these new values (as the “observed”) and the expected values.
It looks challenging to do a Fisher exact test in a spreadsheet.
x1 = c(1,1,0,1,1,1,1,0,1,1,0,1,0,0,1,0,1,0,1,1,1,1,1,0,1,0,1,0,1,1,0,1,1,1,1,1,1,1,1,0,0,0,1,0,1,1,1,1,0,1,1,0,1,0,0,1,1,1,1,1,1,1,1,1,0,1,0,1,0,1,1,0,1,1,1,1)
x2 = c(0,0,0,0,0,0,1,1,1,1,1,1,0,1,0,1,1,0,0,1,1,1,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0,0,0,1,1,0,1,1,1,1,0,1,0,1,1,0,1,0,1,0,0)
n1 = length(x1)
n2 = length(x2)
ns1 = sum(x1) #number of successes in sample 1
ns2 = sum(x2)
prop.test(c(ns1,ns2),c(n1,n2))
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(ns1, ns2) out of c(n1, n2)
## X-squared = 2.6719, df = 1, p-value = 0.1021
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.02733009 0.33434763
## sample estimates:
## prop 1 prop 2
## 0.6973684 0.5438596
prop.test(c(ns1,ns2),c(n1,n2),correct=F)
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(ns1, ns2) out of c(n1, n2)
## X-squared = 3.2986, df = 1, p-value = 0.06934
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01197921 0.31899676
## sample estimates:
## prop 1 prop 2
## 0.6973684 0.5438596
nf1 = n1-ns1
nf2 = n2-ns2
conttab = matrix(c(ns1,nf1,ns2,nf2),nrow=2)
fisher.test(conttab)
##
## Fisher's Exact Test for Count Data
##
## data: conttab
## p-value = 0.1016
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8892012 4.2017117
## sample estimates:
## odds ratio
## 1.922894
An automatic bottle filler is supposed to average 300.00 ml of fluid in each bottle. You sampled some random bottles, recording their volumes:
298.19, 298.05, 299.35, 297.07, 297.82
You are asked to determine a 95% confidence interval, calculate an appropriate -value (using two-tail test), and state whether the filler needs adjustment using a significance level of 0.05.
Determine the lower boundary of the confidence interval.
Determine the upper boundary of the confidence interval.
Determine the -value.
Does the filler need adjustment?
Yes / No
A scratch-off lottery has a stated chance of 0.63 to win. You sampled some tickets, marking a win as “1” and a loss as “0”.
1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
1, 0, 1, 1
Please determine a 95% confidence interval, calculate an appropriate -value, and state whether the sample proportion is significantly different than the stated chance (using a significance level of 0.05).
For the confidence interval, you can use a normal approximation interval, a Wilson score interval, a Wilson score interval with continuity correction, or the exact Clopper-Pearson interval (see descriptions here).
For the -value, you can use similarly use a test, a test (with or without continuity correction), or an exact test.
Determine the lower boundary of the confidence interval.
Determine the upper boundary of the confidence interval.
Determine the -value.
Using a significance level of 0.05, is the sample proportion significantly different from the stated chance?
Yes / No
By hand, this is easiest to do with a normal approximation without the continuity correction.
A study asked individuals to time a mile run (in seconds). After a month, the same individuals timed another mile run. You are asked to perform a paired-data test to investigate whether fitness changed.
| name | x1 | x2 |
|---|---|---|
| Angel | 490 | 441 |
| Italia | 316 | 252 |
| June | 337 | 283 |
| Kalani | 378 | 362 |
| Luke | 451 | 408 |
| Maxwell | 623 | 593 |
| Sarita | 703 | 657 |
| Serenity | 382 | 390 |
| Theodore | 279 | 252 |
| Zara | 431 | 469 |
Please run a paired -test to check whether there was a significant change in running times.
Calculate the -value.
Using a significance level of , is there a significant change in run times?
Yes / No
You first need to determine a list of differences.
| name | x1 | x2 | d = x2-x1 |
|---|---|---|---|
| Angel | 490 | 441 | -49 |
| Italia | 316 | 252 | -64 |
| June | 337 | 283 | -54 |
| Kalani | 378 | 362 | -16 |
| Luke | 451 | 408 | -43 |
| Maxwell | 623 | 593 | -30 |
| Sarita | 703 | 657 | -46 |
| Serenity | 382 | 390 | 8 |
| Theodore | 279 | 252 | -27 |
| Zara | 431 | 469 | 38 |
Determine the sample size and degrees of freedom. Determine the sample mean of the differences. Determine the standard deviation of the differences. You probably do not want to do this by hand. Calculate the -score. Restate the -value. I would recommend using a computer program to determine this -value, like a spreadsheet or R.
The -value is less than 0.05, so the result is significant.
The solution spreadsheet can be downloaded as a csv. The first 10 rows are shown below.
If your spreadsheet does not have the TDIST function, you can try T.DIST.2T(G6,G3,1). Also, notice that T.TEST does everything for you, so you can just use that.
Make/use a directory (folder) for this problem (paired-data test), and set the working directory accordingly. Save run_times.csv to your working directory. I would recommend saving the following script as paired_data_hypothesis_test.r, in the same directory.
table = read.csv("run_times.csv")
x1 = table[['x1']]
x2 = table[['x2']]
t.test(x1,x2,paired=T)
##
## Paired t-test
##
## data: x1 and x2
## t = 2.8682, df = 9, p-value = 0.01853
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.979412 50.620588
## sample estimates:
## mean of the differences
## 28.3
Or, if you wanted to do it the long way:
table = read.csv("run_times.csv")
x1 = table[["x1"]]
x2 = table[["x2"]]
d = x2-x1
n = length(d)
t = abs(mean(d))/(sd(d)/sqrt(n))
cumulative = pt(t,n-1)
pvalue = 2*(1-cumulative)
pvalue
## [1] 0.01853216
A doctor runs a controlled experiment. The participants are randomly assigned to two groups: control and treatment. The participants in the control group are given a placebo. The participants in the treatment group are given a drug.
After a month, each participant’s triglyceride level (in mg/dL) is measured. Those measurements are shown below. They can also be downloaded as a csv.
## Control: 500.1, 519.3, 525.1, 514.5, 498.8, 514, 514.3, 515.3, 525.3, 499.6, 529.8, 518.4, 570.1, 517.4, 520.6, 519.3, 514, 506.9, 520.8, 494.6, 531.4, 486.3, 524.7, 520.9
##
## Treatment: 562.4, 536, 525, 552.3, 523.9, 533.8, 539, 533.2, 528.6, 529.4, 503.2
You are asked to perform a two-tail two-sample Welch’s test to determine whether there is a significant difference of means in the two samples.
Determine the -value.
Is the difference of means significant (using a significance level of 0.05)?
Yes, the drug causes a difference in average triglyceride level. / No, we don’t know whether the drug causes a difference.
To do this by hand, you first calculate the sample statistics. You will get quite close if you round df down (floor).
Then, using a computer application or a table, you should be able to determine the following probabilities (using interpolation to estimate if using table).
You just need to use T.TEST with the proper settings. You can download the solution as a csv.
You just use t.test with the default settings. The hardest part is getting the data imported. You can do this in 2 ways: copy/paste or read.csv.
x1 = c(500.1,519.3,525.1,514.5,498.8,514,514.3,515.3,525.3,499.6,529.8,518.4,570.1,517.4,520.6,519.3,514,506.9,520.8,494.6,531.4,486.3,524.7,520.9)
x2 = c(562.4,536,525,552.3,523.9,533.8,539,533.2,528.6,529.4,503.2)
t.test(x1,x2)
##
## Welch Two Sample t-test
##
## data: x1 and x2
## t = -2.9325, df = 20.369, p-value = 0.008128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -28.422293 -4.810283
## sample estimates:
## mean of x mean of y
## 516.7292 533.3455
Download triglyceride.csv and move it to a directory (folder). Make a script, welchttest.r, and put it in the same directory. Set the working directory to this directory. Then, run the script.
### welchttest.r
data = read.csv("triglyceride.csv")
x1 = data$x1
x2 = data$x2
t.test(x1,x2)
##
## Welch Two Sample t-test
##
## data: x1 and x2
## t = -2.9325, df = 20.369, p-value = 0.008128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -28.422293 -4.810283
## sample estimates:
## mean of x mean of y
## 516.7292 533.3455
A doctor runs a controlled experiment. The sick patients are randomly assigned to two groups: control and treatment. The patients in the control group are given a placebo. The patients in the treatment group are given a drug.
After a month, each patient was checked for whether they recovered from the sickness. A “0” means no recovery while a “1” means recovery. This data can also be downloaded as a csv.
## Control: 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1
##
## Treatment: 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1
You are asked to perform a two-tailed two-proportion test. You can do a two-proportions test (which is equivalent to a 2x2 test [chi squared test]). You will get credit whether or not you apply a continuity correction, but please pool data for the standard error estimation.
Just for completeness, you can also get credit for using Fisher’s exact test.
Determine the -value.
Is the difference of means significant (using a significance level of 0.05)?
Yes, the drug causes a difference in recovery. / No, we don’t know whether the drug causes a difference.
There are many ways to do this problem. I will show the following:
chisq.test function on 2x2 contingency tableR’s prop.test and fisher.test functions.First, somehow you need to determine the sample sizes ( and ) and the sample totals (numbers of recovery), and . You could count… but a computer is probably helpful.
It can be helpful to organize these summary statistics into a contingency table.
| Recover | Not_recover | TOTAL | |
|---|---|---|---|
| Control | 40 | 39 | 79 |
| Treatment | 50 | 39 | 72 |
| TOTAL | 90 | 78 | 151 |
Calculate the proportions.
Determine the absolute score.
Using a table or an online standard-normal probability tool, determine the appropriate probabilities.
This -value did not use the continuity correction.
Solution download.
If you wanted to make Yates’ continuity correction, you would need to calculate the EXPECTED table the same, but then make another table, where the observed values are all 0.5 closer to the expected values. Then, you would use =CHISQ.TEST on these new values (as the “observed”) and the expected values.
It looks challenging to do a Fisher exact test in a spreadsheet.
x1 = c(1,0,0,1,1,1,0,0,1,1,0,1,0,1,1,1,1,0,1,0,1,1,1,1,0,1,1,0,1,1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,0,1,0,0,1,0,1,1,0,1,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1)
x2 = c(1,0,1,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1,1,0,1,1,1,0,1,1,1,1,0,1,1,1,0,0,0,1,1,1,0,0,0,1,0,1,1,1,0,1,1,1,1,0,1,1,0,0,1,0,1,1,1,0,1,1,0,1,0,1,1,0,1,1)
n1 = length(x1)
n2 = length(x2)
ns1 = sum(x1) #number of successes in sample 1
ns2 = sum(x2)
prop.test(c(ns1,ns2),c(n1,n2))
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(ns1, ns2) out of c(n1, n2)
## X-squared = 4.7825, df = 1, p-value = 0.02875
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.35460684 -0.02162383
## sample estimates:
## prop 1 prop 2
## 0.5063291 0.6944444
prop.test(c(ns1,ns2),c(n1,n2),correct=F)
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: c(ns1, ns2) out of c(n1, n2)
## X-squared = 5.5362, df = 1, p-value = 0.01863
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.34133328 -0.03489738
## sample estimates:
## prop 1 prop 2
## 0.5063291 0.6944444
nf1 = n1-ns1
nf2 = n2-ns2
conttab = matrix(c(ns1,nf1,ns2,nf2),nrow=2)
fisher.test(conttab)
##
## Fisher's Exact Test for Count Data
##
## data: conttab
## p-value = 0.02092
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.2184936 0.9254738
## sample estimates:
## odds ratio
## 0.4537105